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ABSTRACT 


This book provides a structured treatment of the key principles and techniques for enabling 
efficient processing of deep neural networks (DNNs). DNNs are currently widely used for 
many artificial intelligence (AI) applications, including computer vision, speech recognition, 
and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the 
cost of high computational complexity. Therefore, techniques that enable efficient processing 
of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and 
latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the 
wide deployment of DNNs in AI systems. 

The book includes background on DNN processing; a description and taxonomy of hard- 
ware architectural approaches for designing DNN accelerators; key metrics for evaluating and 
comparing different designs; features of DNN processing that are amenable to hardware/algo- 
rithm co-design to improve energy efficiency and throughput; and opportunities for applying 
new technologies. Readers will find a structured introduction to the field as well as formalization 
and organization of key concepts from contemporary work that provide insights that may spark 
new ideas. 


KEYWORDS 


deep learning, neural network, deep neural networks (DNN), convolutional neural 
networks (CNN), artificial intelligence (AI), efficient processing, accelerator ar- 
chitecture, hardware/software co-design, hardware/algorithm co-design, domain- 
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Preface 


Deep neural networks (DNNs) have become extraordinarily popular; however, they come at 
the cost of high computational complexity. As a result, there has been tremendous interest in 
enabling efficient processing of DNNs. The challenge of DNN acceleration is threefold: 


e to achieve high performance and efficiency, 


* to provide sufficient flexibility to cater to a wide and rapidly changing range of workloads, 
and 


* to integrate well into existing software frameworks. 


In order to understand the current state of art in addressing this challenge, this book aims 
to provide an overview of DNNs, the various tools for understanding their behavior, and the 
techniques being explored to efficiently accelerate their computation. It aims to explain founda- 
tional concepts and highlight key design considerations when building hardware for processing 
DNNs rather than trying to cover all possible design configurations, as this is not feasible given 
the fast pace of the field (see Figure 1). It is targeted at researchers and practitioners who are 
familiar with computer architecture who are interested in how to efficiently process DNNs or 
how to design DNN models that can be efficiently processed. We hope that this book will pro- 
vide a structured introduction to readers who are new to the field, while also formalizing and 
organizing key concepts to provide insights that may spark new ideas for those who are already 
in the field. 


Organization 
This book is organized into three modules that each consist of several chapters. The first module 


aims to provide an overall background to the field of DNN and insight on characteristics of the 
DNN workload. 


e Chapter 1 provides background on the context of why DNNs are important, their history, 
and their applications. 


e Chapter 2 gives an overview of the basic components of DNNs and popular DNN mod- 
els currently in use. It also describes the various resources used for DNN research and 
development. This includes discussion of the various software frameworks and the public 
datasets that are used for training and evaluation. 


The second module focuses on the design of hardware for processing DNNs. It discusses 
various architecture design decisions depending on the degree of customization (from general 
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Figure 1: It’s been observed that the number of ML publications are growing exponentially at a 
faster rate than Moore’s law! (Figure from [1].) 


purpose platforms to full custom hardware) and design considerations when mapping the DNN 
workloads onto these architectures. Both temporal and spatial architectures are considered. 


Chapter 3 describes the key metrics that should be considered when designing or compar- 
ing various DNN accelerators. 


Chapter 4 describes how DNN kernels can be processed, with a focus on temporal archi- 
tectures such as CPUs and GPUs. To achieve greater efficiency, such architectures gen- 
erally have a cache hierarchy and coarser-grained computational capabilities, e.g., vector 
instructions, making the resulting computation more efficient. Frequently for such ar- 
chitectures, DNN processing can be transformed into a matrix multiplication, which has 
many optimization opportunities. This chapter also discusses various software and hard- 
ware optimizations used to accelerate DNN computations on these platforms without 
impacting application accuracy. 


Chapter 5 describes the design of specialized hardware for DNN processing, with a focus 
on spatial architectures. It highlights the processing order and resulting data movement 
in the hardware used to process a DNN and the relationship to a loop nest representation 
of a DNN. The order of the loops in the loop nest is referred to as the dataflow, and it 
determines how often each piece of data needs to be moved. The limits of the loops in 
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the loop nest describe how to break the DNN workload into smaller pieces, referred to as 
tiling/blocking to account for the limited storage capacity at different levels of the memory 
hierarchy. 


e Chapter 6 presents the process of mapping a DNN workload on to a DNN accelerator. It 
describes the steps required to find an optimized mapping, including enumerating all legal 
mappings and searching those mappings by employing models that project throughput and 
energy efficiency. 


The third module discusses how additional improvements in efficiency can be achieved 
either by moving up the stack through the co-design of the algorithms and hardware or down 
the stack by using mixed signal circuits and new memory or device technology. In the cases 
where the algorithm is modified, the impact on accuracy must be carefully evaluated. 


e Chapter 7 describes how reducing the precision of data and computation can result in 
increased throughput and energy efficiency. It discusses how to reduce precision using 
quantization and the associated design considerations, including hardware cost and impact 
on accuracy. 


e Chapter 8 describes how exploiting sparsity in DNNs can be used to reduce the footprint 
of the data, which provides an opportunity to reduce storage requirements, data move- 
ment, and arithmetic operations. It describes various sources of sparsity and techniques 
to increase sparsity. It then discusses how sparse DNN accelerators can translate sparsity 
into improvements in energy-efficiency and throughput. It also presents a new abstract 
data representation that can be used to express and obtain insight about the dataflows for 
a variety of sparse DNN accelerators. 


e Chapter 9 describes how to optimize the structure of the DNN models (i.e., the ‘network 
architecture’ of the DNN) to improve both throughput and energy efficiency while trying 
to minimize impact on accuracy. It discusses both manual design approaches as well as 
automatic design approaches (i.e., neural architecture search). 


e Chapter 10, on advanced technologies, discusses how mixed-signal circuits and new mem- 
ory technologies can be used to bring the compute closer to the data (e.g., processing in 
memory) to address the expensive data movement that dominates throughput and energy 
consumption of DNNs. It also briefly discusses the promise of reducing energy consump- 
tion and increasing throughput by performing the computation and communication in the 
optical domain. 


What’s New? 
This book is an extension of a tutorial paper written by the same authors entitled “Efficient 
Processing of Deep Neural Networks: A Tutorial and Survey” that appeared in the Proceedings 
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of the IEEE in 2017 and slides from short courses given at ISCA and MICRO in 2016, 2017, and 
2019 (slides available at http://eyeriss.mit.edu/tutorial. html). This book includes recent works 
since the publication of the tutorial paper along with a more in-depth treatment of topics such 
as dataflow, mapping, and processing in memory. We also provide updates on the fast-moving 
field of co-design of DNN models and hardware in the areas of reduced precision, sparsity, 
and efficient DNN model design. As part of this effort, we present a new way of thinking about 
sparse representations and give a detailed treatment of how to handle and exploit sparsity. Finally, 
we touch upon recurrent neural networks, auto encoders, and transformers, which we did not 
discuss in the tutorial paper. 


Scope of book 

The main goal of this book is to teach the reader how to tackle the computational challenge of 
efficiently processing DNNs rather than how to design DNNs for increased accuracy. As a result, 
this book does not cover training (only touching on it lightly), nor does it cover the theory of 
deep learning or how to design DNN models (though it discusses how to make them efficient) 
or use them for different applications. For these aspects, please refer to other references such as 
Goodfellow’s book [2], Amazon’s book [3], and Stanford cs231n course notes [4]. 


Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer 
June 2020 
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PART I 


Understanding Deep Neural 
Networks 


CHAPTER 1 


Introduction 


Deep neural networks (DNNs) are currently the foundation for many modern artificial intel- 
ligence (AI) applications [5]. Since the breakthrough application of DNNs to speech recogni- 
tion [6] and image recognition! [7], the number of applications that use DNNs has exploded. 
These DNNs are employed in a myriad of applications from self-driving cars [8], to detecting 
cancer [9], to playing complex games [10]. In many of these domains, DNNs are now able 
to exceed human accuracy. The superior accuracy of DNNs comes from their ability to extract 
high-level features from raw sensory data by using statistical learning on a large amount of data 
to obtain an effective representation of an input space. This is different from earlier approaches 
that use hand-crafted features or rules designed by experts. 

‘The superior accuracy of DNNs, however, comes at the cost of high computational com- 
plexity. To date, general-purpose compute engines, especially graphics processing units (GPUs), 
have been the mainstay for much DNN processing. Increasingly, however, in these waning days 
of Moore’s law, there is a recognition that more specialized hardware is needed to keep im- 
proving compute performance and energy efficiency [11]. This is especially true in the domain 
of DNN computations. This book aims to provide an overview of DNNs, the various tools for 
understanding their behavior, and the techniques being explored to efficiently accelerate their 
computation. 


1.1 BACKGROUND ON DEEP NEURAL NETWORKS 


In this section, we describe the position of DNNs in the context of artificial intelligence (AI) 
in general and some of the concepts that motivated the development of DNNs. We will also 
present a brief chronology of the major milestones in the history of DNNs, and some current 
domains to which it is being applied. 


1.1.1 ARTIFICIAL INTELLIGENCE AND DEEP NEURAL NETWORKS 


DNNs, also referred to as deep learning, are a part of the broad field of AI. AI is the science and 
engineering of creating intelligent machines that have the ability to achieve goals like humans 
do, according to John McCarthy, the computer scientist who coined the term in the 1950s. The 
relationship of deep learning to the whole of AI is illustrated in Figure 1.1. 


Image recognition is also commonly referred to as image classification. 
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Figure 1.1: Deep learning in the context of artificial intelligence. 


Within AI is a large sub-field called machine learning, which was defined in 1959 by 
Arthur Samuel [12] as “the field of study that gives computers the ability to learn without being 
explicitly programmed.” That means a single program, once created, will be able to learn how to 
do some intelligent activities outside the notion of programming. This is in contrast to purpose- 
built programs whose behavior is defined by hand-crafted heuristics that explicitly and statically 
define their behavior. 

The advantage of an effective machine learning algorithm is clear. Instead of the laborious 
and hit-or-miss approach of creating a distinct, custom program to solve each individual problem 
in a domain, a single machine learning algorithm simply needs to learn, via a process called 
training, to handle each new problem. 

Within the machine learning field, there is an area that is often referred to as brain- 
inspired computation. Since the brain is currently the best “machine” we know of for learning 
and solving problems, it is a natural place to look for inspiration. Therefore, a brain-inspired 
computation is a program or algorithm that takes some aspects of its basic form or functionality 
from the way the brain works. This is in contrast to attempts to create a brain, but rather the 
program aims to emulate some aspects of how we understand the brain to operate. 

Although scientists are still exploring the details of how the brain works, it is generally 
believed that the main computational element of the brain is the neuron. There are approximately 
86 billion neurons in the average human brain. The neurons themselves are connected by a num- 
ber of elements entering them, called dendrites, and an element leaving them, called an axon, 
as shown in Figure 1.2. The neuron accepts the signals entering it via the dendrites, performs a 
computation on those signals, and generates a signal on the axon. These input and output sig- 
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Figure 1.2: Connections to a neuron in the brain. x;, w;, f(-), and b are the activations, weights, 
nonlinear function, and bias, respectively. (Figure adapted from [4].) 


nals are referred to as activations. The axon of one neuron branches out and is connected to the 
dendrites of many other neurons. The connections between a branch of the axon and a dendrite 
is called a synapse. There are estimated to be 1014 to 10!° synapses in the average human brain. 

A key characteristic of the synapse is that it can scale the signal (x;) crossing it, as shown 
in Figure 1.2. That scaling factor can be referred to as a weight (w;), and the way the brain is 
believed to learn is through changes to the weights associated with the synapses. ‘Thus, different 
weights result in different responses to an input. One aspect of learning can be thought of as the 
adjustment of weights in response to a learning stimulus, while the organization (what might be 
thought of as the program) of the brain largely does not change. This characteristic makes the 
brain an excellent inspiration for a machine-learning-style algorithm. 

Within the brain-inspired computing paradigm, there is a subarea called spiking comput- 
ing. In this subarea, inspiration is taken from the fact that the communication on the dendrites 
and axons are spike-like pulses and that the information being conveyed is not just based on a 
spike’s amplitude. Instead, it also depends on the time the pulse arrives and that the computation 
that happens in the neuron is a function of not just a single value but the width of pulse and 
the timing relationship between different pulses. The IBM TrueNorth project is an example of 
work that was inspired by the spiking of the brain [13]. In contrast to spiking computing, an- 
other subarea of brain-inspired computing is called neural networks, which is the focus of this 


book.” 


Note: Recent work using TrueNorth in a stylized fashion allows it to be used to compute reduced precision neural 
networks [14]. These types of neural networks are discussed in Chapter 7. 
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Figure 1.3: Simple neural network example and terminology. (Figure adapted from [4].) 


1.1.2 NEURAL NETWORKS AND DEEP NEURAL NETWORKS 


Neural networks take their inspiration from the notion that a neurons computation involves a 
weighted sum of the input values. These weighted sums correspond to the value scaling per- 
formed by the synapses and the combining of those values in the neuron. Furthermore, the 
neuron does not directly output that weighted sum because the expressive power of the cascade 
of neurons involving only linear operations is just equal to that of a single neuron, which is very 
limited. Instead, there is a functional operation within the neuron that is performed on the com- 
bined inputs. This operation appears to be a nonlinear function that causes a neuron to generate 
an output only if its combined inputs cross some threshold. Thus, by analogy, neural networks 
apply a nonlinear function to the weighted sum of the input values.’ These nonlinear functions 
are inspired by biological functions, but are not meant to emulate the brain. We look at some of 
those nonlinear functions in Section 2.3.3. 

Figure 1.3a shows a diagram of a three-layer (non-biological) neural network. The neurons 
in the input layer receive some values, compute their weighted sums followed by the nonlinear 
function, and propagate the outputs to the neurons in the middle layer of the network, which 
is also frequently called a “hidden layer.” A neural network can have more than one hidden 
layer, and the outputs from the hidden layers ultimately propagate to the output layer, which 
computes the final outputs of the network to the user. To align brain-inspired terminology with 
neural networks, the outputs of the neurons are often referred to as activations, and the synapses 
are often referred to as weights, as shown in Figure 1.3a. We will use the activation/weight 
nomenclature in this book. 


3Without a nonlinear function, multiple layers could be collapsed into one. 
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Figure 1.4: Example of image classification using deep neural networks. (Figure adapted 
from [15].) Note that the features go from low level to high level as we go deeper into the 
network. 


4 
Figure 1.3b shows an example of the computation at layer 1: y; = f(X Wij x xi + bj), 
i=1 


where W;;, xi, and y; are the weights, input activations, and output activations, respectively, and 
fC) is a nonlinear function described in Section 2.3.3. The bias term b; is omitted from Fig- 
ure 1.3b for simplicity. In this book, we will use the color green to denote weights, blue to denote 
activations, and red to denote weighted sums (or partial sums, which are further accumulated to 
become the final weighted sums). 

Within the domain of neural networks, there is an area called deep learning, in which the 
neural networks have more than three layers, i.e., more than one hidden layer. Today, the typical 
numbers of network layers used in deep learning range from 5 to more than a 1,000. In this 
book, we will generally use the terminology deep neural networks (DNNs) to refer to the neural 
networks used in deep learning. 

DNNs are capable of learning high-level features with more complexity and abstraction 
than shallower neural networks. An example that demonstrates this point is using DNNs to 
process visual data, as shown in Figure 1.4. In these applications, pixels of an image are fed 
into the first layer of a DNN, and the outputs of that layer can be interpreted as representing 
the presence of different low-level features in the image, such as lines and edges. In subsequent 
layers, these features are then combined into a measure of the likely presence of higher-level 
features, e.g., lines are combined into shapes, which are further combined into sets of shapes. 
Finally, given all this information, the network provides a probability that these high-level fea- 
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Figure 1.5: Example of an image classification task. The machine learning platform takes in an 
image and outputs the class probabilities for a predefined set of classes. 


tures comprise a particular object or scene. This deep feature hierarchy enables DNNs to achieve 
superior performance in many tasks. 


1.2 TRAINING VERSUS INFERENCE 


Since DNNs are an instance of machine learning algorithms, the basic program does not change 
as it learns to perform its given tasks. In the specific case of DNNs, this learning involves de- 
termining the value of the weights (and biases) in the network, and is referred to as training 
the network. Once trained, the program can perform its task by computing the output of the 
network using the weights determined during the training process. Running the program with 
these weights is referred to as inference. 

In this section, we will use image classification, as shown in Figure 1.5, as a driving example 
for training and using a DNN. When we perform inference using a DNN, the input is image 
and the output is a vector of values representing the class probabilities. There is one value for 
each object class, and the class with the highest value indicates the most likely (predicted) class 
of object in the image. The overarching goal for training a DNN is to determine the weights 
that maximize the probability of the correct class and minimize the probabilities of the incorrect 
classes. The correct class is generally known, as it is often defined in the training set. The gap 
between the ideal correct probabilities and the probabilities computed by the DNN based on its 
current weights is referred to as the /oss (L). Thus, the goal of training DNNs is to find a set of 
weights to minimize the average loss over a large training set. 

When training a network, the weights (w;;) are usually updated using a hill-climbing 
(hill-descending) optimization process called gradient descent. In gradient descent, a weight is 
updated by a scaled version of the partial derivative of the loss with respect to the weight (i.e., 
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Sidebar: Key steps in training 


Here, we will provide a very brief summary of the key steps of training and deploying 
a model. For more details, we recommend the reader refer to more comprehensive ref- 
erences such as [2]. First, we collect a labeled dataset and divide the data into subsets 
for training and testing. Second, we use the training set to train a model so that it can 
learn the weights for a given task. After achieving adequate accuracy on the training set, 
the ultimate quality of the model is determined by how accurately it performs on unseen 
data. Therefore, in the third step, we test the trained model by asking it to predict the 
labels for a test set that it has never seen before and compare the prediction to the ground 
truth labels. Generalization refers to how well the model maintains the accuracy between 
training and unseen data. If the model does not generalize well, it is often referred to 
as overfitting; this implies that the model is fitting to the noise rather than the under- 
lying data structure that we would like it to learn. One way to combat overfitting is to 
have a large, diverse dataset; it has been shown that accuracy increases logarithmically 
as a function of the number of training examples [16]. Section 2.6.3 will discuss vari- 
ous popular datasets used for training. There are also other mechanisms that help with 


generalization including Regularization. It adds constraints to the model during train- 


ing such as smoothness, number of parameters, size of the parameters, prior distribution 
or structure, or randomness in the training using dropout [17]. Further partitioning the 
training set into training and va/idation sets is another useful tool. Designing a DNN 
requires determining (tuning) a large number of hyperparameters such as the size and 
shape of a layer or the number of layers. Tuning the hyperparameters based on the test 
set may cause overfitting to the test set, which results in a misleading evaluation of the 
true performance on unseen data. In this circumstance, the validation set can be used 
instead of the test set to mitigate this problem. Finally, if the model performs sufficiently 
well on the test set, it can be deployed on unlabeled images. 





: ; = Wi; — ax, where a is called the learning rate“). Note that this gradient 
1J 


indicates how the weights should change in order to reduce the loss. The process is repeated 
iteratively to reduce the overall loss. 


updated to w 


An efficient way to compute the partial derivatives of the gradient is through a process 
called backpropagation. Backpropagation, which is a computation derived from the chain rule of 


4A large learning rate increases the step size applied at each iteration, which can help speed up the training, but may also 
result in overshooting the minimum or cause the optimization to not converge. A small learning rate decreases the step size 
applied at each iteration which slows down the training, but increases likelihood of convergence. There are various methods 
to set the learning rate such as ADAM [18], etc. Finding the best the learning rate is one of the key challenges in training 
DNNs. 
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Figure 1.6: An example of backpropagation through a neural network. 


calculus, operates by passing values backward through the network to compute how the loss is 
affected by each weight. 

This backpropagation computation is, in fact, very similar in form to the computation used 
for inference, as shown in Figure 1.6 [19].° Thus, techniques for efficiently performing inference 
can sometimes be useful for performing training. There are, however, some important additional 
considerations to note. First, backpropagation requires intermediate outputs of the network to 
be preserved for the backward computation, thus training has increased storage requirements. 
Second, due to the gradients use for hill-climbing (hill-descending), the precision requirement 
for training is generally higher than inference. Thus, many of the reduced precision techniques 
discussed in Chapter 7 are limited to inference only. 

A variety of techniques are used to improve the efficiency and robustness of training. For 
example, often, the loss from multiple inputs is computed before a single pass of weight updates 
is performed. This is called ġaźching, which helps to speed up and stabilize the process.® 


ƏL 


>To backpropagate through each layer: (1) compute the gradient of the loss relative to the weights, Jop’ from the layer 


inputs (i.e., the forward activations, x;) and the gradients of the loss relative to the layer outputs, TA and (2) compute the 


gradient of the loss relative to the layer inputs, ƏL Ł , from the layer weights, w;;, and the gradients of the loss relative to the 
Xi . 
ƏL 
layer outputs, 3 ee 


6There are various forms of gradient decent which differ in terms of how frequently to update the weights. Batch Gradient 
Descent updates the weights after computing the loss on the entire training set, which is computationally expensive and 
requires significant storage. Stochastic Gradient Descent update weights after computing loss on a single training example and 
the examples are shuffled after going through the entire training set. While it is fast, looking at a single example can be noisy 
and cause the weights to go in the wrong direction. Finally, Mini-batch Gradient Descent divides the training set into smaller 
sets called mini-batches, and updates weights based on the loss of each mini-batch (commonly referred to simply as “batch”); 
this approach is most commonly used. In general, each pass through the entire training set is referred to as an epoch. 
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There are multiple ways to train the weights. The most common approach, as described 
above, is called supervised learning, where all the training samples are labeled (e.g., with the 
correct class). Unsupervised learning is another approach, where no training samples are labeled. 
Essentially, the goal is to find the structure or clusters in the data. Semi-supervised learning falls 
between the two approaches, where only a small subset of the training data is labeled (e.g., use 
unlabeled data to define the cluster boundaries, and use the small amount of labeled data to label 
the clusters). Finally, reinforcement learning can be used to the train the weights such that given 
the state of the current environment, the DNN can output what action the agent should take 
next to maximize expected rewards; however, the rewards might not be available immediately 
after an action, but instead only after a series of actions (often referred to as an episode). 

Another commonly used approach to determine weights is fine-tuning, where previously 
trained weights are available and are used as a starting point and then those weights are adjusted 
for a new dataset (e.g., transfer learning) or for a new constraint (e.g., reduced precision). This 
results in faster training than starting from a random starting point, and can sometimes result 
in better accuracy. 

This book will focus on the efficient processing of DNN inference rather than training, 
since DNN inference is often performed on embedded devices (rather than the cloud) where 
resources are limited, as discussed in more details later. 


1.3. DEVELOPMENT HISTORY 


Although neural networks were proposed in the 1940s, the first practical application employing 
multiple digital neurons didn't appear until the late 1980s, with the LeNet network for hand- 
written digit recognition [20].” Such systems are widely used by ATMs for digit recognition on 
checks. The early 2010s have seen a blossoming of DNN-based applications, with highlights 
such as Microsoft’s speech recognition system in 2011 [6] and the AlexNet DNN for image 
recognition in 2012 [7]. A brief chronology of deep learning is shown in Figure 1.7. 

The deep learning successes of the early 2010s are believed to be due to a confluence 
of three factors. The first factor is the amount of available information to train the networks. 
To learn a powerful representation (rather than using a hand-crafted approach) requires a large 
amount of training data. For example, Facebook receives up to a billion images per day, Walmart 
creates 2.5 Petabytes of customer data hourly and YouTube has over 300 hours of video uploaded 
every minute. As a result, these and many other businesses have a huge amount of data to train 
their algorithms. 

The second factor is the amount of compute capacity available. Semiconductor device and 
computer architecture advances have continued to provide increased computing capability, and 
we appear to have crossed a threshold where the large amount of weighted sum computation 
in DNNs, which is required for both inference and training, can be performed in a reasonable 
amount of time. 


7In the early 1960s, single neuron systems built out of analog logic were used for adaptive filtering [21, 22]. 
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DNN Timeline 


1940s: Neural networks were proposed 


1960s: Deep neural networks were proposed 


1989: Neural networks for recognizing hand-written digits (LeNet) 
1990s: Hardware for shallow neural nets (Intel ETANN) 
2011: Breakthrough DNN-based speech recognition (Microsoft) 


2012: DNNs for vision start supplanting hand-crafted approaches (AlexNet) 
2014+: Rise of DNN accelerator research (Neuflow, DianNao...) 











Figure 1.7: A concise history of neural networks. “Deep” refers to the number of layers in the 
network. 


‘The successes of these early DNN applications opened the floodgates of algorithmic de- 
velopment. It has also inspired the development of several (largely open source) frameworks 
that make it even easier for researchers and practitioners to explore and use DNNs. Combining 
these efforts contributes to the third factor, which is the evolution of the algorithmic techniques 
that have improved accuracy significantly and broadened the domains to which DNNs are being 
applied. 

An excellent example of the successes in deep learning can be illustrated with the Ima- 
geNet Challenge [23]. This challenge is a contest involving several different components. One 
of the components is an image classification task, where algorithms are given an image and they 
must identify what is in the image, as shown in Figure 1.5. The training set consists of 1.2 mil- 
lion images, each of which is labeled with one of a thousand object categories that the image 
contains. For the evaluation phase, the algorithm must accurately identify objects in a test set of 
images, which it hasn't previously seen. 

Figure 1.8 shows the performance of the best entrants in the ImageNet contest over a 
number of years. The accuracy of the algorithms initially had an error rate of 25% or more. In 
2012, a group from the University of Toronto used graphics processing units (GPUs) for their 
high compute capability and a DNN approach, named AlexNet, and reduced the error rate by 
approximately 10 percentage points [7]. Their accomplishment inspired an outpouring of deep 
learning algorithms that have resulted in a steady stream of improvements. 

In conjunction with the trend toward using deep learning approaches for the ImageNet 
Challenge, there has been a corresponding increase in the number of entrants using GPUs: from 
2012 when only 4 entrants used GPUs to 2014 when almost all the entrants (110) were using 
them. This use of GPUs reflects the almost complete switch from traditional computer vision 
approaches to deep learning-based approaches for the competition. 
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Figure 1.8: Results from the ImageNet Challenge [23]. 


In 2015, the ImageNet winning entry, ResNet [24], exceeded human-level accuracy with 
a Top-5 error rate? below 5%. Since then, the error rate has dropped below 3% and more focus 
is now being placed on more challenging components of the competition, such as object de- 
tection and localization. These successes are clearly a contributing factor to the wide range of 
applications to which DNNs are being applied. 


1.4 APPLICATIONS OF DNNs 


Many domains can benefit from DNNs, ranging from entertainment to medicine. In this sec- 
tion, we will provide examples of areas where DNNs are currently making an impact and high- 
light emerging areas where DNNs may make an impact in the future. 


* Image and Video: Video is arguably the biggest of big data. It accounts for over 70% of 
today’s Internet traffic [25]. For instance, over 800 million hours of video is collected daily 
worldwide for video surveillance [26]. Computer vision is necessary to extract meaningful 
information from video. DNNs have significantly improved the accuracy of many com- 
puter vision tasks such as image classification [23], object localization and detection [27], 
image segmentation [28], and action recognition [29]. 


Speech and Language: DNNs have significantly improved the accuracy of speech recog- 
nition [30] as well as many related tasks such as machine translation [6], natural language 
processing [31], and audio generation [32]. 


Medicine and Health Care: DNNs have played an important role in genomics to gain in- 
sight into the genetics of diseases such as autism, cancers, and spinal muscular atrophy [33— 


8The Top-5 error rate is measured based on whether the correct answer appears in one of the top five categories selected 
by the algorithm. 
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36]. They have also been used in medical imaging such as detecting skin cancer [9], brain 
cancer [37], and breast cancer [38]. 


e Game Play: Recently, many of the grand AI challenges involving game play have been 
overcome using DNNs. ‘These successes also required innovations in training techniques, 
and many rely on reinforcement learning [39]. DNNs have surpassed human level accuracy 
in playing games such as Atari [40], Go [10], and StarCraft [41], where an exhaustive 
search of all possibilities is not feasible due to the immense number of possible moves. 


e Robotics: DNNs have been successful in the domain of robotic tasks such as grasping 
with a robotic arm [42], motion planning for ground robots [43], visual navigation [8, 44], 
control to stabilize a quadcopter [45], and driving strategies for autonomous vehicles [46]. 


DNNs are already widely used in multimedia applications today (e.g., computer vision, 
speech recognition). Looking forward, we expect that DNNs will likely play an increasingly 
important role in the medical and robotics fields, as discussed above, as well as finance (e.g., 
for trading, energy forecasting, and risk assessment), infrastructure (e.g., structural safety, and 
traffic control), weather forecasting, and event detection [47]. The myriad application domains 
pose new challenges to the efficient processing of DNNs; the solutions then have to be adaptive 
and scalable in order to handle the new and varied forms of DNNs that these applications may 
employ. 
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‘The various applications and aspects of DNN processing (i.e., training versus inference) have 
different computational needs. Specifically, training often requires a large dataset’ and signif- 
icant computational resources for multiple weight-update iterations. In many cases, training a 
DNN model still takes several hours to multiple days (or weeks or months!) and thus is typically 
performed in the cloud. 

Inference, on the other hand, can happen either in the cloud or at the edge (e.g., Internet 
of Things (IoT) or mobile). In many applications, it is desirable to have the DNN inference pro- 
cessing at the edge near the sensor. For instance, in computer vision applications, such as mea- 
suring wait times in stores or predicting traffic patterns, it would be desirable to extract mean- 
ingful information from the video right at the image sensor rather than in the cloud, to reduce 
the communication cost. For other applications, such as autonomous vehicles, drone navigation, 
and robotics, local processing is desired since the latency and security risks of relying on the cloud 
are too high. However, video involves a large amount of data, which is computationally complex 
to process; thus, low-cost hardware to analyze video is challenging, yet critical, to enabling these 
applications.'° Speech recognition allows us to seamlessly interact with electronic devices, such 

°One of the major drawbacks of DNNs is their need for large datasets to prevent overfitting during training. 


10As a reference, running a DNN on an embedded devices is estimated to consume several orders of magnitude higher 
energy per pixel than video compression, which is a common form of processing near image sensor [48]. 
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as smartphones. While currently most of the processing for applications such as Apple Siri and 
Amazon Alexa voice services is in the cloud, it is still desirable to perform the recognition on 
the device itself to reduce latency. Some work have even considered partitioning the processing 
between the cloud and edge at a per layer basis in order to improve performance [49]. However, 
considerations related to dependency on connectivity, privacy, and security augur for keeping 
computation at the edge. Many of the embedded platforms that perform DNN inference have 
stringent requirements on energy consumption, compute and memory cost limitations; efficient 
processing of DNNs has become of prime importance under these constraints. 
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CHAPTER 2 


Overview of Deep Neural 
Networks 


Deep Neural Networks (DNNs) come in a wide variety of shapes and sizes depending on the 
application.’ The popular shapes and sizes are also evolving rapidly to improve accuracy and 
efficiency. In all cases, the input to a DNN is a set of values representing the information to be 
analyzed by the network. For instance, these values can be pixels of an image, sampled amplitudes 
of an audio wave, or the numerical representation of the state of some system or game. 

In this chapter, we will describe the key building blocks for DNNs. As there are many 
different types of DNNs [50], we will focus our attention on those that are most widely used. We 
will begin by describing the salient characteristics of commonly used DNN layers in Sections 2.1 
and 2.2. We will then describe popular DNN layers and how these layers can be combined to 
form various types of DNNs in Section 2.3. Section 2.4 will provide a detailed discussion on 
convolutional neural networks (CNNs), since they are widely used and tend to provide many 
opportunities for efficient DNN processing. It will also highlight various popular CNN models 
that are often used as workloads for evaluating DNN hardware accelerators. Next, in Section 2.5, 
we will briefly discuss other types of DNNs and describe how they are similar to and differ 
from CNNs from a workload processing perspective (e.g., data dependencies, types of compute 
operations, etc.). Finally, in Section 2.6, we will discuss the various DNN development resources 
(e.g., frameworks and datasets), which researchers and practitioners have made available to help 
enable the rapid progress in DNN model and hardware research and development. 


2.1 ATTRIBUTES OF CONNECTIONS WITHIN A LAYER 


As discussed in Chapter 1, DNNs are composed of several processing layers, where in most 
layers the main computation is a weighted sum. There are several different types of layers, which 
primarily differ in terms of how the inputs and outputs are connected within the layers. 

There are two main attributes of the connections within a layer: 


1. The connection pattern between the input and output activations, as shown in Figure 2.1a: 
if a layer has the attribute that every input activation is connected to every output, then we 


1The DNN research community often refers to the shape and size of a DNN as its “network architecture.” However, to 
avoid confusion with the use of the word “architecture” by the hardware community, we will talk about “DNN models” and 
their shape and size in this book. 
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Figure 2.1: Properties of connections in DNNs (Figure adapted from [4)). 


call that layer fully connected. On the other hand, if a layer has the attribute that only a subset 
of inputs are connected to the output, then we call that layer sparsely connected. Note that 
the weights associated with these connections can be zero or non-zero; if a weight happens 
to be zero (e.g., as a result of training), it does not mean there is no connection (i.e., the 
connection still exists). 


For sparsely connected layers, a sub attribute is related to the structure of the connections. 
Input activations may connect to any output activation (i.e., global), or they may only 
connect to output activations in their neighborhood (i.e., local). The consequence of such 
local connections is that each output activation is a function of a restricted window of input 
activations, which is referred to as the receptive field. 


2. The value of the weight associated with each connection: the most general case is that 
the weight can take on any value (e.g., each weight can have a unique value). A more 
restricted case is that the same value is shared by multiple weights, which is referred to as 
weight sharing. 


Combinations of these attributes result in many of the common layer types. Any layer with 
the fully connected attribute is called a fully connected layer (FC layer). In order to distinguish 
the attribute from the type of layer, in this chapter, we will use the term FC layer as distinguished 
from the fully connected attribute. However, in subsequent chapters we will follow the common 
practice of using the terms interchangeably. Another widely used layer type is the convolutional 
(CONV) layer, which is locally, sparsely connected with weight sharing.” The computation in 
FC and CONV layers is a weighted sum. However, there are other computations that might be 


2CONV layers use a specific type of weight sharing, which will be described in Section 2.4. 
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performed and these result in other types of layers. We will discuss FC, CONV, and these other 
layers in more detail in Section 2.3. 


2.2 ATTRIBUTES OF CONNECTIONS BETWEEN LAYERS 


Another attribute is the connections from the output of one layer to the input of another layer, 
as shown in Figure 2.1b. The output can be connected to the input of the next layer in which 
case the connection is referred to as feed forward. With feed-forward connections, all of the 
computation is performed as a sequence of operations on the outputs of a previous layer.’ It 
has no memory and the output for an input is always the same irrespective of the sequence 
of inputs previously given to the network. DNNs that contain feed-forward connections are 
referred to as feed-forward networks. Examples of these types of networks include multi-layer 
perceptrons (MLPs), which are DNNs that are composed entirely of feed-forward FC layers and 
convolutional neural networks (CNNs), which are DNNs that contain both FC and CONV 
layers. CNNs, which are commonly used for image processing and computer vision, will be 
discussed in more detail in Section 2.4. 

Alternatively, the output can be fed back to the input of its own layer in which case the 
connection is often referred to as recurrent. With recurrent connections, the output of a layer is 
a function of both the current and prior input(s) to the layer. This creates a form of memory in 
the DNN, which allows long-term dependencies to affect the output. DNNs that contain these 
connections are referred to as recurrent neural networks (RNNs), which are commonly used to 
process sequential data (e.g., speech, text), and will be discussed in more detail in Section 2.5. 


2.3 POPULAR TYPES OF LAYERS IN DNNs 


In this section, we will discuss the various popular layers used to form DNNs. We will begin 
by describing the CONV and FC layers whose main computation is a weighted sum, since that 
tends to dominate the computation cost in terms of both energy consumption and throughput. 
We will then discuss various layers that can optionally be included in a DNN and do not use 
weighted sums such as nonlinearity, pooling, and normalization. 

‘These layers can be viewed as primitive layers, which can be combined to form compound 
layers. Compound layers are often given names as a convenience, when the same combination 
of primitive layer are frequently used together. In practice, people often refer to either primitive 
or compound layers as just layers. 


2.3.1 CONV LAYER (CONVOLUTIONAL) 


CONV layers are primarily composed of high-dimensional convolutions, as shown in Figure 2.2. 
In this computation, the input activations of a layer are structured as a 3-D input feature map 


3 Connections can come from the immediately preceding layer or an earlier layer. Furthermore, connections from a layer 
can go to multiple later layers. 
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Figure 2.2: Dimensionality of convolutions. (a) Shows the traditional 2-D convolution used in 
image processing. (b) Shows the high dimensional convolution used in CNNs, which applies a 
2-D convolution on each channel. 


(ifmap), where the dimensions are the height (H), width (W), and number of input channels 
(C). The weights of a layer are structured as a 3-D filter, where the dimensions are the height 
(R), width (S), and number of input channels (C). Notice that the number of channels for the 
input feature map and the filter are the same. For each input channel, the input feature map 
undergoes a 2-D convolution (see Figure 2.2a) with the corresponding channel in the filter. The 
results of the convolution at each point are summed across all the input channels to generate 
the output partial sums. In addition, a 1-D (scalar) bias can be added to the filtering results, 
but some recent networks [24] remove its usage from parts of the layers. The results of this 
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Table 2.1: Shape parameters of a CONV/FC layer 


Shape Parameter | Description 
Batch size of 3-D fmaps 





Number of 3-D filters / number of channels of ofmap (output channels) 





Number of channels of filter / ifmap (input channels) 





Ifmap spatial height/width 
Filter spatial height/width (= H/W in FC) 
Ofmap spatial height/width (= 1 in FC) 














computation are the output partial sums that comprise one channel of the output feature map 
(ofmap).* Additional 3-D filters can be used on the same input feature map to create additional 
output channels (i.e., applying M filters to the input feature map generates M output channels 
in the output feature map). Finally, multiple input feature maps (N) may be processed together 
as a batch to potentially improve reuse of the filter weights. 

Given the shape parameters in Table 2.1,° the computation of a CONV layer is defined 


as: 
C-1R-1S-1 


o[n|[m][p]{q] = O 2 2 ifa]ic][Up + r][Uq + s] x fim]iclir]is]) + bin], 


05n <N0Sm<M0=p < P04 < 9. 
P=(H —R+U)/U,Q =(W-S+U)/U. 


(2.1) 


o, i, f, and b are the tensors of the ofmaps, ifmaps, filters, and biases, respectively. U is a given 
stride size. 

Figure 2.2b shows a visualization of this computation (ignoring biases). As much as pos- 
sible, we will adhere to the following coloring scheme in this book. 


e Blue: input activations belonging to an input feature map. 


e Green: weights belonging to a filter. 


4For simplicity, in this chapter, we will refer to an array of partial sums as an output feature map. However, technically, 
the output feature map would be composed the values of the partial sums after they have gone through a nonlinear function 
(i.e., the output activations). 

5In some literature, K is used rather than M to denote the number of 3-D filters (also referred to a kernels), which 
determines the number of output feature map channels. We opted not to use K to avoid confusion with yet other communities 
that use it to refer to the number of dimensions. We also have adopted the convention of using P and Q as the dimensions of 
the output to align with other publications and since our prior use of E and F caused an alias with the use of “F” to represent 
filter weights. Note that some literature also use X and Y to denote the spatial dimensions of the input rather than W and 
H. 
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e Red: partial sums—Note: since there is no formal term for an array of partial sums, we will 
sometimes label an array of partial sums as an output feature map and color it red (even 
though, technically, output feature maps are composed of activations derived from partial 
sums that have passed through a nonlinear function and therefore should be blue). 


Returning to the CONV layer calculation in Equation (2.1), one notes that the operands 
(i.e., the ofmaps, ifmaps, and filters) have many dimensions. Therefore, these operands can be 
viewed as fensors (i.e., high-dimension arrays) and the computation can be treated as a tensor 
algebra computation where the computation involves performing binary operations (e.g., mul- 
tiplications and additions forming dot products) between tensors to produce new tensors. Since 
the CONV layer can be viewed as a tensor algebra operation, it is worth noting that an alterna- 
tive representation for a CONV layer can be created using the tensor index notation found in [51], 
which describes a compiler for sparse tensor algebra computations.° The tensor index notation 
provides a compact way to describe a kernel’s functionality. For example, in this notation matrix 
multiply Z = AB can be written as: 


Z; = J, Aik Baz. (2.2) 


That is, the output point (i, j) is formed by taking a dot product of k values along the i-th row 
of A and the j-th column of B.’ Extending this notation to express computation on the index 
variables (by putting those calculations in parenthesis) allows a CONV layer in tensor index 
notation to be represented quite concisely as: 


Onmpq = aa Tnc(Up+r)(Ug+s)F mers) + bm. (2.3) 


In this calculation, each output at a point (n,m, p,q) is calculated as a dot product taken across 
the index variables c, r, and s of the specified elements of the input activation and filter weight 
tensors. Note that this notation attaches no significance to the order of the index variables in the 
summation. The relevance of this will become apparent in the discussion of dataflows (Chapter 5) 
and mapping computations onto a DNN accelerator (Chapter 6). 

Finally, to align the terminology of CNNs with the generic DNN, 


* filters are composed of weights (i.e., synapses), and 


* input and output feature maps (ifmaps, ofmaps) are composed of input and output ac- 
tivations (partial sums after application of a nonlinear function) (i.e., input and output 
neurons). 


Note that many of the values in the CONV layer tensors are zero, making the tensors sparse. The origins of this sparsity, 
and approaches for performing the resulting sparse tensor algebra, are presented in Chapter 8. 

7Note that Albert Einstein popularized a similar notation for tensor algebra which omits any explicit specification of the 
summation variable. 
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Figure 2.3: Fully connected layer from convolution point of view with H = R, W = S, P = 
Q=1,andU = 1. 


2.3.2 FC LAYER (FULLY CONNECTED) 


In an FC layer, every value in the output feature map is a weighted sum of every input value in 
the input feature map (i.e., it is fully connected). Furthermore, FC layers typically do not exhibit 
weight sharing and as a result the computation tends to be memory-bound. FC layers are often 
processed in the form of a matrix multiplication, which will be explained in Chapter 4. This is 
the reason while matrix multiplication is often associated with DNN processing. 

An FC layer can also be viewed as a special case of a CONV layer. Specifically, a CONV 
layer where the filters are of the same size as the input feature maps. Therefore, it does not have 
the local, sparsely connected with weight sharing property of CONV layers. Therefore, Equa- 
tion (2.1) still holds for the computation of FC layers with a few additional constraints on the 
shape parameters: H = R, W = S, P = Q = 1, and U = 1. Figure 2.3 shows a visualization 
of this computation and in the tensor index notation from Section 2.3.1 it is: 


Onm = See TnchwF mchw- (2.4) 


2.3.3 NONLINEARITY 


A nonlinear activation function is typically applied after each CONV or FC layer. Various non- 
linear functions are used to introduce nonlinearity into the DNN, as shown in Figure 2.4. These 
include historically conventional nonlinear functions such as sigmoid or hyperbolic tangent. 
‘These were popular because they facilitate mathematical analysis/proofs. The rectified linear unit 
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Figure 2.4: Various forms of nonlinear activation functions. (Figure adapted from [62].) 


(ReLU) [52] has become popular in recent years due to its simplicity and its ability to enable fast 
training, while achieving comparable accuracy. Variations of ReLU, such as leaky ReLU [53], 
parametric ReLU [54], exponential LU [55], and Swish [56] have also been explored for im- 
proved accuracy. Finally, a nonlinearity called maxout, which takes the maximum value of two 
intersecting linear functions, has shown to be effective in speech recognition tasks [57, 58]. 


2.3.4 POOLING AND UNPOOLING 


‘There are a variety of computations that can be used to change the spatial resolution (i.e., H 
and W or P and Q) of the feature map depending on the application. For applications such as 
image classification, the goal is to summarize the entire image into one label; therefore, reducing 
the spatial resolution may be desirable. Networks that reduce input into a sparse output are 
often referred to as encoder networks. For applications such as semantic segmentation, the goal 
is to assign a label to each pixel in the image;’ as a result, increasing the spatial resolution may 
be desirable. Networks that expand input into a dense output are often referred to as decoder 
networks. 

Reducing the spatial resolution of a feature map is referred to as pooling or more generically 
downsampling. Pooling, which is applied to each channel separately, enables the network to be 


8In addition to being simple to implement, ReLU also increases the sparsity of the output activations, which can be 


exploited by a DNN accelerator to increase throughput, reduce energy consumption and reduce storage cost, as described in 
Section 8.1.1. 
°In the literature, this is often referred to dense prediction. 
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Figure 2.6: Various forms of unpooling/upsampling. (Figures adapted from [64].) 


robust and invariant to small shifts and distortions. Pooling combines, or pools, a set of values 
in its receptive field into a smaller number of values. Pooling can be parameterized based on the 
size of its receptive field (e.g., 2x2) and pooling operation (e.g., max or average), as shown in 
Figure 2.5. Typically, pooling occurs on non-overlapping blocks (i.e., the stride is equal to the 
size of the pooling). Usually a stride of greater than one is used such that there is a reduction in 
the spatial resolution of the representation (i.e., feature map). Pooling is usually performed after 
the nonlinearity. 

Increasing the spatial resolution of a feature map is referred to as unpooling or more gener- 
ically as upsampling. Commonly used forms of upsampling include inserting zeros between the 
activations, as shown in Figure 2.6a (this type of upsampling is commonly referred to as unpool- 
ing!°), interpolation using nearest neighbors [63, 64], as shown in Figure 2.6b, and interpolation 
with bilinear or bicubic filtering [65]. Upsampling is usually performed before the CONV or FC 
layer. Upsampling can introduce structured sparsity in the input feature map that can be ex- 
ploited for improved energy efficiency and throughput, as described in Section 8.1.1. 


2.3.55 NORMALIZATION 


Controlling the input distribution across layers can help to significantly speed up training and 
improve accuracy. Accordingly, the distribution of the layer input activations (o, jz) are normal- 


10 There are two versions of unpooling: (1) zero insertion is applied in a regular pattern, as shown in Figure 2.6a [60 ]—this 
is most commonly used; and (2) unpooling is paired with a max pooling layer, where the location of the max value during 
pooling is stored, and during unpooling the location of the non-zero value is placed in the location of the max value before 


pooling [61]. 


26 2. OVERVIEW OF DEEP NEURAL NETWORKS 


ized such that it has a zero mean and a unit standard deviation. In batch normalization (BN), the 
normalized value is further scaled and shifted, as shown in Equation (2.5), where the parameters 


(y, P) are learned from training [66]:1!>! 


x= ph 
Y=- 
vo? +e 


where € is a small constant to avoid numerical problems. 

Prior to the wide adoption of BN, local response normalization (LRN) [7] was used, 
which was inspired by lateral inhibition in neurobiology where excited neurons (i.e., high value 
activations) should subdue its neighbors (i.e., cause low value activations); however, BN is now 
considered standard practice in the design of CNNs while LRN is mostly deprecated. Note that 
while LRN is usually performed after the nonlinear function, BN is usually performed between 
the CONV or FC layer and the nonlinear function. If BN is performed immediately after the 
CONV or FC layer, its computation can be folded into the weights of the CONV or FC layer 


resulting in no additional computation for inference. 


y +, (2.5) 


2.3.6 COMPOUND LAYERS 


‘The above primitive layers can be combined to form compound layers. For instance, attention 
layers are composed of matrix multiplications and feed-forward, fully connected layers [68]. 
Attention layers have become popular for processing a wide range of data including language 
and images and are commonly used in a type of DNNs called Transformers. We will discuss 
transformers in more detail in Section 2.5. Another example of a compound layer is the up- 
convolution layer [60], which performs zero-insertion (unpooling) on the input and then applies 
a convolutional layer.’ Up-convolution layers are typically used in DNNs such as General Ad- 
versarial Networks (GANs) and Auto Encoders (AEs) that process image data. We will discuss 
GANs and AEs in more detail in Section 2.5. 


2.4 CONVOLUTIONAL NEURAL NETWORKS (CNNs) 


CNNs are a common form of DNNs that are composed of multiple CONV layers, as shown 
in Figure 2.7. In such networks, each layer generates a successively higher-level abstraction of 


"Jt has been recently reported that the reason batch normalization enables faster and more stable training is due to the fact 
that it makes the optimization landscape smoother resulting in more predictive and stable behavior of the gradient [67]; this 
is in contrast to the popular belief that batch normalization stabilizes the distribution of the input across layers. Nonetheless, 
batch normalization continues to be widely used for training and thus needs to be supported during inference. 

During training, parameters o and u are computed per batch, and y and £ are updated per batch based on the gradient; 
therefore, training for different batch sizes will result in different o and u parameters, which can impact accuracy. Note that 
each channel has its own set of o, 4, y, and B parameters. During inference, all parameters are fixed, where o and p are 
computed from the entire training set. To avoid performing an extra pass over the entire training set to compute o and 4, 0 
and u are usually implemented as the running average of the per batch o and jz computed during training. 

13Note variants of the up CONV layer with different types of upsampling include deconvolution layer, sub-pixel or frac- 
tional convolutional layer, transposed convolutional layer, and backward convolution layer [69]. 
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Figure 2.7: Convolutional Neural Networks. 


the input data, called a feature map (fmap), which preserves essential yet unique information. 
Modern CNNs are able to achieve superior performance by employing a very deep hierarchy of 
layers. CNNs are widely used in a variety of applications including image understanding [7], 
speech recognition [70], game play [10], robotics [42], etc. This book will focus on its use in 
image processing, specifically for the task of image classification [7]. Modern CNN models 
for image classification typically have 5 [7] to more than a 1,000 [24] CONV layers. A small 
number, e.g., 1 to 3, of FC layers are typically applied after the CONV layers for classification 
purposes. 


2.4.1 POPULAR CNN MODELS 


Many CNN models have been developed over the past two decades. Each of these models 
are different in terms of number of layers, layer types, layer shapes (i.e., filter size, number of 
channels and filters), and connections between layers. Understanding these variations and trends 
is important for incorporating the right flexibility in any efficient DNN accelerator, as discussed 
in Chapter 3. 

In this section, we will give an overview of various popular CNNs such as LeNet [71] as 
well as those that competed in and/or won the ImageNet Challenge [23], as shown in Figure 1.8, 
most of whose models with pre-trained weights are publicly available for download; the CNN 
models are summarized in Table 2.2. Two results for Top-5 error are reported. In the first row, 
the accuracy is boosted by using multiple crops from the image and an ensemble of multiple 
trained models (i.e., the CNN needs to be run several times); these results were used to compete 
in the ImageNet Challenge. The second row reports the accuracy if only a single crop was used 
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Table 2.2: Summary of popular CNNs [7, 24, 71, 73, 74]. TAccuracy is measured based on Top- 
5 error on ImageNet [23] using multiple crops. This version of LeNet-5 has 431k weights for 


the filters and requires 2.3M MACs per image, and uses ReLU rather than sigmoid. 


Metrics 
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Overfeat 
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(i.e., the CNN is run only once), which is more consistent with what would likely be deployed 
in real-time and/or energy-constrained applications. 

LeNet [20] was one of the first CNN approaches introduced in 1989. It was designed for 
the task of digit classification in grayscale images of size 28x28. The most well known version, 
LeNet-5, contains two CONV layers followed by two FC layers [71]. Each CONV layer uses 
filters of size 5x5 (1 channel per filter) with 6 filters in the first layer and 16 filters in the second 
layer. Average pooling of 2x2 is used after each convolution and a sigmoid is used for the non- 
linearity. In total, LeNet requires 60k weights and 341k multiply-and-accumulates (MACs) per 
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image. LeNet led to CNN?’ first commercial success, as it was deployed in ATMs to recognize 
digits for check deposits. 

Alex Net [7] was the first CNN to win the ImageNet Challenge in 2012. It consists of five 
CONV layers followed by three FC layers. Within each CONV layer, there are 96 to 384 filters 
and the filter size ranges from 3x3 to 11x11, with 3 to 256 channels each. In the first layer, 
the three channels of the filter correspond to the red, green, and blue components of the input 
image. A ReLU nonlinearity is used in each layer. Max pooling of 3x3 is applied to the outputs 
of layers 1, 2, and 5. To reduce computation, a stride of 4 is used at the first layer of the network. 
AlexNet introduced the use of LRN in layers 1 and 2 before the max pooling, though LRN is 
no longer popular in later CNN models. One important factor that differentiates AlexNet from 
LeNet is that the number of weights is much larger and the shapes vary from layer to layer. 
To reduce the amount of weights and computation in the second CONV layer, the 96 output 
channels of the first layer are split into two groups of 48 input channels for the second layer, 
such that the filters in the second layer only have 48 channels. This approach is referred to as 
“grouped convolution” and illustrated in Figure 2.8.14 Similarly, the weights in fourth and fifth 
layer are also split into two groups. In total, AlexNet requires 61M weights and 724M MACs 
to process one 227x227 input image. 

Overfeat [72] has a very similar architecture to AlexNet with five CONV layers followed 
by three FC layers. The main differences are that the number of filters is increased for layers 3 
(384 to 512), 4 (384 to 1024), and 5 (256 to 1024), layer 2 is not split into two groups, the first 
FC layer only has 3072 channels rather than 4096, and the input size is 231x231 rather than 
227x227. As a result, the number of weights grows to 146M and the number of MACs grows 
to 2.8G per image. Overfeat has two different models: fast (described here) and accurate. The 
accurate model used in the ImageNet Challenge gives a 0.65% lower Top-5 error rate than the 
fast model at the cost of 1.9x more MACs. 

VGG-16 [73] goes deeper to 16 layers consisting of 13 CONV layers followed by 3 FC 
layers. In order to balance out the cost of going deeper, larger filters (e.g., 5x5) are built from 
multiple smaller filters (e.g., 3x3), which have fewer weights, to achieve the same effective re- 
ceptive fields, as shown in Figure 2.9a. As a result, all CONV layers have the same filter size of 
3x3. In total, VGG-16 requires 138M weights and 15.5G MACs to process one 224x224 input 
image. VGG has two different models: VGG-16 (described here) and VGG-19. VGG-19 gives 
a 0.1% lower Top-5 error rate than VGG-16 at the cost of 1.27x more MACs. 

GoogLeNet [74] goes even deeper with 22 layers. It introduced an inception module, 
shown in Figure 2.10, whose input is distributed through multiple feed-forward connections 
to several parallel layers. These parallel layers contain different sized filters (i.e., 1x1, 3x3, 5x5), 
along with 3x3 max-pooling, and their outputs are concatenated for the module output. Using 
multiple filter sizes has the effect of processing the input at multiple scales. For improved train- 


This grouped convolution approach is applied more aggressively when performing co-design of algorithms and hardware 
to reduce complexity, which will be discussed in Chapter 9. 
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Figure 2.8: An example of dividing feature map into two grouped convolutions. Each filter requires 
2x fewer weights and multiplications. 
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(a) Constructing a 5x5 support from 3x3 filters. Used in VGG-16. 
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(b) Constructing a 5x5 support from 1x5 and 5x1 filter. Used in GoogLeNet/Inception v3 and v4. 





Figure 2.9: Decomposing larger filters into smaller filters. 
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Figure 2.10: Inception module from GoogLeNet [74] with example channel lengths. Note that 
each CONV layer is followed by a ReLU (not drawn). 


ing speed, GoogLeNet is designed such that the weights and the activations, which are stored 
for backpropagation during training, could all fit into the GPU memory. In order to reduce the 
number of weights, 1x1 filters are applied as a “bottleneck” to reduce the number of channels for 
each filter [75], as shown in Figure 2.11. The 22 layers consist of three CONV layers, followed 
by nine inceptions modules (each of which are two CONV layers deep), and one FC layer. The 
number of FC layers was reduce from three to one using a global average pooling layer, which 
summarizes the large feature map from the CONV layers into one value; global pooling will 
be discussed in more detail in Section 9.1.2. Since its introduction in 2014, GoogLeNet (also 
referred to as Inception) has multiple versions: v1 (described here), v3,!° and v4. Inception-v3 
decomposes the convolutions by using smaller 1-D filters, as shown in Figure 2.9b, to reduce 
number of MACs and weights in order to go deeper to 42 layers. In conjunction with batch 
normalization [66], v3 achieves over 3% lower Top-5 error than v1 with 2.5x more MACs [76]. 
Inception-v4 uses residual connections [77], described in the next section, for a 0.4% reduction 
in error. 

ResNet [24], also known as Residual Net, uses feed-forward connections that connects to 
layers beyond the immediate next layer (often referred to as residual, skip or identity connections); 
these connections enable a DNN with many layers (e.g., 34 or more) to be trainable. It was 


1542 is very similar to v3. 
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Figure 2.11: Apply 1x1xC filter (usually referred to as 1x1) to capture cross-channel correlation, 
but no spatial correlation. This bottleneck approach reduces the number of channels in next layer 
assuming the number of filters applied (M) is less than the original number of channels (C). 
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the first entry CNN in ImageNet Challenge that exceeded human-level accuracy with a Top- 
5 error rate below 5%. One of the challenges with deep networks is the vanishing gradient 
during training [78]; as the error backpropagates through the network the gradient shrinks, 
which affects the ability to update the weights in the earlier layers for very deep networks. ResNet 
introduces a “shortcut” module which contains an identity connection such that the weight layers 
(ie., CONV layers) can be skipped, as shown in Figure 2.12. Rather than learning the function 
for the weight layers F (x), the shortcut module learns the residual mapping (F(x) = H(x) — x). 
Initially, F(x) is zero and the identity connection is taken; then gradually during training, the 
actual forward connection through the weight layer is used. ResNet also uses the “bottleneck” 
approach of using 1x1 filters to reduce the number of weights. As a result, the two layers in 
the shortcut module are replaced by three layers (1x1, 3x3, 1x1) where the first 1x1 layer 
reduces the number of activations and thus weights in the 3x3 layer, the last 1x1 layer restores 
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Figure 2.12: Shortcut module from ResNet [24]. Note that ReLU following last CONV layer 
in shortcut is after the addition. 


the number of activations in the output of the third layer. ResNet-50 consists of one CONV 
layer, followed by 16 shortcut layers (each of which are 3 CONV layers deep), and 1 FC layer; 
it requires 25.5M weights and 3.9G MACs per image. There are various versions of ResNet 
with multiple depths (e.g., without bottleneck: 18, 34; with bottleneck: 50, 101, 152). The ResNet 
with 152 layers was the winner of the ImageNet Challenge requiring 11.3G MACs and 60M 
weights. Compared to ResNet-50, it reduces the Top-5 error by around 1% at the cost of 2.9 
more MACs and 2.5x more weights. 

Several trends can be observed in the popular CNNs shown in Table 2.2. Increasing the 
depth of the network tends to provide higher accuracy. Controlling for number of weights, a 
deeper network can support a wider range of nonlinear functions that are more discriminative 
and also provides more levels of hierarchy in the learned representation [24, 73, 74, 79]. The 
number of filter shapes continues to vary across layers, thus flexibility is still important. Fur- 
thermore, most of the computation has been placed on CONV layers rather than FC layers. In 
addition, the number of weights in the FC layers is reduced and in most recent networks (since 
GoogLeNet) the CONV layers also dominate in terms of weights. ‘Thus, the focus of hardware 
implementations targeted at CNNs should be on addressing the efficiency of the CONV layers, 
which in many domains are increasingly important. 

Since ResNet, there have been several other notable networks that have been proposed to 
increase accuracy. DenseNet [84] extends the concept of skip connections by adding skip con- 
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Figure 2.13: Auto Encoder network for semantic segmentation. Feature maps along with pooling 
and upsampling layers are shown. (Figure adapted from [92].) 


nection from multiple previous layers to strengthen feature map propagation and feature reuse. 
This concept, commonly referred to as feature aggregation, continues to be widely explored. 
WideNet [85] proposes increasing the width (i.e., the number of filters) rather than depth of 
network, which has the added benefit that increasing width is more parallel-friendly than in- 
creasing depth. ResNeXt [86] proposes increasing the number of convolution groups (referred to 
as cardinality) instead of depth and width of network and was used as part of the winning entry 
for ImageNet in 2017. Finally, EfficientNet [87] proposes uniformly scaling all dimensions in- 
cluding depth, width, and resolution rather than focusing on a single dimension since there is an 
interplay between the different dimensions (e.g., to support higher input image resolution, the 
DNN needs higher depth to increase the receptive field and higher width to capture more fine- 
grained patterns). WideNet, ResNeXt, and EfficientNet demonstrate that there exists methods 
beyond increasing depth for increasing accuracy, and thus highlights that there remains much to 
be explored and understood about the relationship between layer shape, number of layers, and 
accuracy. 


2.5 OTHERDNNs 


There are other types of DNNs beyond CNNs including Recurrent Neural Networks 
(RNNs) [88, 89], Transformers [68], Auto Encoders (AEs) [90], and General Adversarial Net- 
works (GANs) [91]. The diverse types of DNNs allow them to handle a wide range of inputs for 
a wide range of tasks. For instance, RNNs and Transformers are often used to handle sequential 
data that can have variable length (e.g., audio for speech recognition, or text for natural language 
processing). AEs and GANs can be used to generate dense output predictions by combining en- 
coder and decoder networks. Example applications that use AEs include predicting pixel-wise 
depth values for depth estimation [64] and assigning pixel-wise class labels for semantic segmen- 
tation [92], as shown in Figure 2.13. Example applications that use GANs to generate images 
with the same statistics as the training set include image synthesis [93] and style transfer [94]. 
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Figure 2.14: Dependencies in RNN are in both the time and depth dimension. The same 
weights (W;) are used across time, while different weights are used across depth. (Figure adapted 
from [4].) 


While their applications may differ from the CNNs described in Section 2.4, many of the 
building blocks and primitive layers are similar. For instance, RNNs and transformers heavily rely 
on matrix multiplications, which means that they have similar challenges as FC layers (e.g., they 
are memory bound due to lack of data reuse); thus, many of the techniques used to accelerate FC 
layers can also be used to accelerate RNNs and transformers (e.g., tiling discussed in Chapter 4, 
network pruning discussed in Chapter 8, etc.). Similarly, the decoder network of GANs and AEs 
for image processing use up-convolution layers, which involves upsampling the input feature map 
using zero insertion (unpooling) before applying a convolution; thus, many of the techniques 
used to accelerate CONV layers can also be used to accelerate the decoder network of GANs 
and AEs for image processing (e.g., exploit input activation sparsity discussed in Chapter 8). 

While the dominant compute aspect of these DNNs are similar to CNNs, they do of- 
ten require some other forms of compute. For instance, RNNs, particularly Long Short-Term 
Memory networks (LSTMs) [95], require support of element-wise multiplications as well a 
variety of nonlinear functions (sigmoid, tanh), unlike CNNs which typically only use ReLU. 
However, these operations do not tend to dominate run-time or energy consumption; they can 
be computed in software [96] or the nonlinear functions can be approximated by piecewise linear 
look up tables [97]. For GANs and AEs, additional support is required for upsampling. 

Finally, RNNs have additional dependencies since the output of a layer is fed back to its 
input, as shown in Figure 2.14. For instance, the inputs to layer i at time ¢ depends on the 
output of layer i — 1 at time ¢ and layer i at time ¢ — 1. This is similar to the dependency across 
layers, in that the output of layer i is the input to layer i + 1. These dependencies limit what 
inputs can be processed in parallel (e.g., within the same batch). For DNNs with feed-forward 
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layers, any inputs can be processed at the same time (i.e., batch size greater than one); however, 
multiple layers of the same input cannot be processed at the same time (e.g., layers i andi + 1). In 
contrast, RNNs can only process multiple inputs at the same time if the inputs are not sequentially 
dependent, in other words, RNNs can process two separate sequences at the same time, but not 
multiple elements within the sequence (e.g., inputs ¢ and £ + 1 of the same sequence) and not 
multiple layers of the same input (which is similar to feed-forward networks). 


2.6 DNN DEVELOPMENT RESOURCES 
One of the key factors that has enabled the rapid development of DNNs is the set of develop- 


ment resources that have been made available by the research community and industry. These 
resources are also key to the development of DNN accelerators by providing characterizations of 
the workloads and facilitating the exploration of trade-offs in model complexity and accuracy. 
This section will describe these resources such that those who are interested in this field can 
quickly get started. 


2.6.1 FRAMEWORKS 


For ease of DNN development and to enable the sharing of trained networks, several deep learn- 
ing frameworks have been developed from various sources. These open-source libraries contain 
software libraries for DNNs. Caffe was made available in 2014 from UC Berkeley [59]. It sup- 
ports C, C++, Python, and MATLAB. Tensorflow [98] was released by Google in 2015, and 
supports C++ and Python; it also supports multiple CPUs and GPUs and has more flexibility 
than Caffe, with the computation expressed as dataflow graphs to manage the “tensors” (multidi- 
mensional arrays). Another popular framework is Torch, which was developed by Facebook and 
NYU and supports C, C++, and Lua; PyTorch [99] is its successor and is built in Python. There 
are several other frameworks such as Theano, MXNet, CNTK, which are described in [100]. 
There are also higher-level libraries that can run on top of the aforementioned frameworks to 
provide a more universal experience and faster development. One example of such libraries is 
Keras, which is written in Python and supports Tensorflow, CNTK, and Theano. 

The existence of such frameworks are not only a convenient aid for DNN researchers and 
application designers, but they are also invaluable for engineering high performance or more 
efficient DNN computation engines. In particular, because the frameworks make heavy use of 
a set of primitive operations, such as the processing of a CONV layer, they can incorporate 
use of optimized software or hardware accelerators. This acceleration is transparent to the user 
of the framework. Thus, for example, most frameworks can use Nvidias cuDNN library for 
rapid execution on Nvidia GPUs. Similarly, transparent incorporation of dedicated hardware 
accelerators can be achieved as was done with the Eyeriss chip using Caffe [101]. 

Finally, these frameworks are a valuable source of workloads for hardware researchers. 
‘They can be used to drive experimental designs for different workloads, for profiling different 
workloads and for exploring hardware-algorithm trade-offs. 
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Figure 2.15: MNIST (10 classes, 60k training, 10k testing) [103] versus ImageNet (1000 classes, 
1.3M training, 100k testing) [23] dataset. 


2.6.2 MODELS 


Pretrained DNN models can be downloaded from various websites [80-83] for the various dif- 
ferent frameworks. It should be noted that even for the same DNN (e.g., AlexNet) the accuracy 
of these models can vary by around 1 to 2% depending on how the model was trained and tested, 
and thus the results do not always exactly match the original publication. 

‘These pre-trained models often are tied to a given framework. In order to facilitate easier 
exchange between different networks, Open Neural Network Exchange (ONNX) has been es- 
tablished as an open ecosystem for interchangeable DNN models [102]; the current participants 
include Amazon, Facebook, and Microsoft. 


2.6.3 POPULAR DATASETS FOR CLASSIFICATION 


It is important to factor in the difficulty of the task when comparing different DNN mod- 
els. For instance, the task of classifying handwritten digits from the MNIST dataset [103] is 
much simpler than classifying an object into one of 1000 classes as is required for the ImageNet 
dataset [23] (Figure 2.15). It is expected that the size of the DNNs (i.e., number of weights) and 
the number of MACs will be larger for the more difficult task than the simpler task and thus 
require more energy and have lower throughput. For instance, LeNet-5[71] is designed for digit 
classification, while AlexNet[7], VGG-16[73], GoogLeNet[74], and ResNet[24] are designed 
for the 1000-class image classification. 

‘There are many AI tasks that come with publicly available datasets in order to evaluate the 
accuracy of a given DNN. Public datasets are important for comparing the accuracy of different 
approaches. The simplest and most common task in computer vision is image classification, 
which involves being given an entire image, and selecting 1 of N classes that the image most 
likely belongs to. There is no localization or detection. 

MNIST is a widely used dataset for digit classification that was introduced in 1998 [103]. 
It consists of 28x28 pixel grayscale images of handwritten digits. There are 10 classes (for 10 
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digits) and 60,000 training images and 10,000 test images. LeNet-5 was able to achieve an 
accuracy of 99.05% when MNIST was first introduced. Since then the accuracy has increased to 
99.79% using regularization of neural networks with dropconnect [104]. Thus, MNIST is now 
considered a fairly easy dataset. 

CIFAR is a dataset that consists of 32x32 pixel colored images of various objects, which 
was released in 2009 [105]. CIFAR is a subset of the 80 million Tiny Image dataset [106]. 
CIFAR-10 is composed of 10 mutually exclusive classes. There are 50,000 training images (5000 
per class) and 10,000 test images (1000 per class). A two-layer convolutional deep belief network 
was able to achieve 64.84% accuracy on CIFAR-10 when it was first introduced [107]. Since 
then the accuracy has increased to 96.53% using fractional max pooling [108]. 

ImageNet is a large-scale image dataset that was first introduced in 2010; the dataset sta- 
bilized in 2012 [23]. It contains images of 256x256 pixel in color with 1000 classes. The classes 
are defined using the WordNet as a backbone to handle ambiguous word meanings and to com- 
bine together synonyms into the same object category. In other words, there is a hierarchy for 
the ImageNet categories. The 1000 classes were selected such that there is no overlap in the Im- 
ageNet hierarchy. The ImageNet dataset contains many fine-grained categories including 120 
different breeds of dogs. There are 1.3M training images (732 to 1300 per class), 100,000 testing 
images (100 per class) and 50,000 validation images (50 per class). 

The accuracy for the image classification task in the ImageNet Challenge are reported 
using two metrics: Top-5 and Top-1 accuracy.'° Top-5 accuracy means that if any of the top five 
scoring categories are the correct category, it is counted as a correct classification. Top-1 accuracy 
requires that the top scoring category be correct. In 2012, the winner of the ImageNet Challenge 
(AlexNet) was able to achieve an accuracy of 83.6% for the Top-5 (which is substantially better 
than the 73.8% which was second place that year that did not use DNNs); it achieved 61.9% 
on the Top-1 of the validation set. In 2019, the state-of-the-art DNNs achieve accuracy above 
97% for the Top-5 and above 84% for the Top-1 [87]. 

In summary of the various image classification datasets, it is clear that MNIST is a fairly 
easy dataset, while ImageNet is a more challenging one with a wider coverage of classes. Thus, 
in terms of evaluating the accuracy of a given DNN, it is important to consider that dataset upon 
which the accuracy is measured. 


2.6.4 DATASETS FOR OTHER TASKS 


Since the accuracy of the state-of-the-art DNNs are performing better than human-level accu- 
racy on image classification tasks, the ImageNet Challenge has started to focus on more difficult 
tasks such as single-object localization and object detection. For single-object localization, the 
target object must be localized and classified (out of 1000 classes). The DNN outputs the top 
five categories and top five bounding box locations. There is no penalty for identifying an object 
that is in the image but not included in the ground truth. For object detection, all objects in the 


16Note that in some parts of the book we use Top-1 and Top-5 error. The error can be computed as 100% minus accuracy. 
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image must be localized and classified (out of 200 classes). The bounding box for all objects in 
these categories must be labeled. Objects that are not labeled are penalized as well as duplicated 
detections. 

Beyond ImageNet, there are also other popular image datasets for computer vision tasks. 
For object detection, there is the PASCAL VOC (2005-2012) dataset that contains 11k images 
representing 20 classes (27k object instances, 7k of which have detailed segmentation) [109]. 
For object detection, segmentation, and recognition in context, there is the M.S. COCO dataset 
with 2.5M labeled instances in 328k images (91 object categories) [110]; compared to ImageNet, 
COCO has fewer categories but more instances per category, which is useful for precise 2-D 1o- 
calization. COCO also has more labeled instances per image to potentially help with contextual 
information. 

Most recently, even larger scale datasets have been made available. For instance, Google 
has an Open Images dataset with over 9M images [111], spanning 6000 categories. There is also 
a YouTube dataset with 8M videos (0.5M hours of video) covering 4800 classes [112]. Google 
also released an audio dataset comprised of 632 audio event classes and a collection of 2M 
human-labeled 10-second sound clips [113]. These large datasets will be evermore important as 
DNNs become deeper with more weights to train. In addition, it has been shown that accuracy 
increases logarithmically based on the amount of training data [16].!” 

Undoubtedly, both larger datasets and datasets for new domains will serve as important 
resources for profiling and exploring the efficiency of future DNN engines. 


2.6.5 SUMMARY 


The development resources presented in this section enable us to evaluate hardware using the 
appropriate DNN model and dataset. In particular, it’s important to realize that difficult tasks 
typically require larger models; for instance, LeNet would not apply to the ImageNet Challenge. 
In addition, different datasets are required for different tasks; for instance, self-driving cars re- 
quire high-definition video, and thus a network trained on the low resolution ImageNet dataset 
may not be sufficient. To address these requirements, the number of datasets continues to grow 
at a rapid pace. 


17This was demonstrated on Google’s internal JFT-300M dataset with 300M images and 18,291 classes, which is two 
orders of magnitude larger than ImageNet. However, performing four iterations across the entire training set using 50 K- 
80 GPUs required two months of training, which further emphasizes that compute is one of the main bottlenecks in the 
advancement of DNN research. 
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CHAPTER 3 


Key Metrics and Design 
Objectives 


Over the past few years, there has been a significant amount of research on efficient process- 
ing of DNNs. Accordingly, it is important to discuss the key metrics that one should consider 
when comparing and evaluating the strengths and weaknesses of different designs and proposed 
techniques and that should be incorporated into design considerations. While efficiency is often 
only associated with the number of operations per second per Watt (e.g., floating-point opera- 
tions per second per Watt as FLOPS/W or tera-operations per second per Watt as TOPS/W), 
it is actually composed of many more metrics including accuracy, throughput, latency, energy 
consumption, power consumption, cost, flexibility, and scalability. Reporting a comprehensive 
set of these metrics is important in order to provide a complete picture of the trade-offs made 
by a proposed design or technique. 
In this chapter, we will 


e discuss the importance of each of these metrics; 


e breakdown the factors that affect each metric. When feasible, present equations that de- 
scribe the relationship between the factors and the metrics; 


e describe how these metrics can be incorporated into design considerations for both the 
DNN hardware and the DNN model (i.e., workload); and 


e specify what should be reported for a given metric to enable proper evaluation. 


Finally, we will provide a case study on how one might bring all these metrics together for a 
holistic evaluation of a given approach. But first, we will discuss each of the metrics. 


3.1 ACCURACY 


Accuracy is used to indicate the quality of the result for a given task. The fact that DNNs can 
achieve state-of-the-art accuracy on a wide range of tasks is one of the key reasons driving the 
popularity and wide use of DNNs today. The units used to measure accuracy depend on the 
task. For instance, for image classification, accuracy is reported as the percentage of correctly 
classified images, while for object detection, accuracy is reported as the mean average precision 
(mAP), which is related to the trade off between the true positive rate and false positive rate. 
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Factors that affect accuracy include the difficulty of the task and dataset. For instance, 
classification on ImageNet is much more difficult than on MNIST, and object detection or 
semantic segmentation is more difficult than classification. As a result, a DNN model that per- 
forms well on MNIST may not necessarily perform well on ImageNet. 

Achieving high accuracy on difficult tasks or datasets typically requires more complex 
DNN models (e.g., a larger number of MAC operations and more distinct weights, increased 
diversity in layer shapes, etc.), which can impact how efficiently the hardware can process the 
DNN model. 

Accuracy should therefore be interpreted in the context of the difficulty of the task and 
dataset.” Evaluating hardware using well-studied, widely used DNN models, tasks, and datasets 
can allow one to better interpret the significance of the accuracy metric. Recently, motivated by 
the impact of the SPEC benchmarks for general purpose computing [114], several industry 
and academic organizations have put together a broad suite of DNN models, called MLPerf, to 
serve as a common set of well-studied DNN models to evaluate the performance and enable fair 
comparison of various software frameworks, hardware accelerators, and cloud platforms for both 
training and inference of DNNs [115].° The suite includes various types of DNNs (e.g., CNN, 
RNN, etc.) for a variety of tasks including image classification, object identification, translation, 
speech-to-text, recommendation, sentiment analysis, and reinforcement learning. 


3.2 THROUGHPUT AND LATENCY 


Throughput is used to indicate the amount of data that can be processed or the number of exe- 
cutions of a task that can be completed in a given time period. High throughput is often critical 
to an application. For instance, processing video at 30 frames per second is necessary for deliv- 
ering real-time performance. For data analytics, high throughput means that more data can be 
analyzed in a given amount of time. As the amount of visual data is growing exponentially, high- 
throughput big data analytics becomes increasingly important, particularly if an action needs to 
be taken based on the analysis (e.g., security or terrorist prevention; medical diagnosis or drug 
discovery). Throughput is often generically reported as the number of operations per second. In 
the case of inference, throughput is reported as inferences per second or in the form of runtime 
in terms of seconds per inference. 

Latency measures the time between when the input data arrives to a system and when the 
result is generated. Low latency is necessary for real-time interactive applications, such as aug- 
mented reality, autonomous navigation, and robotics. Latency is typically reported in seconds. 


1Ideally, robustness and fairness should be considered in conjunction with accuracy, as there is also an interplay between 
these factors; however, these are areas of on-going research and beyond the scope of this book. 

?As an analogy, getting 9 out of 10 answers correct on a high school exam is different than 9 out of 10 answers correct 
on a college-level exam. One must look beyond the score and consider the difficulty of the exam. 

3Earlier DNN benchmarking efforts including DeepBench [116] and Fathom [117] have now been subsumed by 
MLPerf. 
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Throughput and latency are often assumed to be directly derivable from one another. 
However, they are actually quite distinct. A prime example of this is the well-known approach 
of batching input data (e.g., batching multiple images or frames together for processing) to 
increase throughput since it amortizes overhead, such as loading the weights; however, batching 
also increases latency (e.g., at 30 frames per second and a batch of 100 frames, some frames will 
experience at least 3.3 second delay), which is not acceptable for real-time applications, such 
as high-speed navigation where it would reduce the time available for course correction. Thus, 
achieving low latency and high throughput simultaneously can sometimes be at odds depending 
on the approach and both should be reported.* 

There are several factors that affect throughput and latency. In terms of throughput, the 
number of inferences per second is affected by 





inferences operations 1 a 
second second» ©Perations (3.1) 
inference 


where the number of operations per second is dictated by both the DNN hardware and DNN 
model, while the number of operations per inference is dictated by the DNN model. 

When considering a system comprised of multiple processing elements (PEs), where a PE 
corresponds to a simple or primitive core that performs a single MAC operation, the number of 
operations per second can be further decomposed as follows: 











operations 1 cycles ee 
P = = * y xnumber of PEs x utilization of PEs. (3.2) 
second cycles second 
operation 
for a single PE 


The first term reflects the peak throughput of a single PE, the second term reflects the amount 
of parallelism, while the last term reflects degradation due to the inability of the architecture to 
effectively utilize the PEs. 

Since the main operation for processing DNNs is a MAC, we will use number of opera- 
tions and number of MAC operations interchangeably. 

One can increase the peak throughput of a single PE by increasing the number of cycles 
per second, which corresponds to a higher clock frequency, by reducing the critical path at the 


4The phenomenon described here can also be understood using Little’s Law [118] from queuing theory, where the re- 

lationship between average throughput and average latency are related by the average number of tasks in flight, as defined 
by 

= — __ tasks-in-flight 

throughput = a, 

latency 

A DNN-centric version of Little’s Law would have throughput measured in inferences per second, latency measured in sec- 
onds, and inferences-in-flight, as the tasks-in-flight equivalent, measured in the number of images in a batch being processed 
simultaneously. This helps to explain why increasing the number of inferences in flight to increase throughput may be coun- 
terproductive because some techniques that increase the number of inferences in flight (e.g., batching) also increase latency. 
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circuit or micro-architectural level, or the number of cycles per operations, which can be affected 
by the design of the MAC (e.g., a non-pipelined multi-cycle MAC would have more cycles per 
operation). 

While the above approaches increase the throughput of a single PE, the overall throughput 
can be increased by increasing the number of PEs, and thus the maximum number of MAC 
operations that can be performed in parallel. The number of PEs is dictated by the area density 
of the PE and the area cost of the system. If the area cost of the system is fixed, then increasing 
the number of PEs requires either increasing the area density of the PE (i.e., reduce the area per 
PE) or trading off on-chip storage area for more PEs. Reducing on-chip storage, however, can 
affect the utilization of the PEs, which we will discuss next. 

Increasing the density of PEs can also be achieved by reducing the logic associated with 
delivering operands to a MAC. This can be achieved by controlling multiple MACs with a 
single piece of logic. This is analogous to the situation in instruction-based systems such as 
CPUs and GPUs that reduce instruction bookkeeping overhead by using large aggregate instruc- 
tions (e.g., single-instruction, multiple-data (SIMD)/Vector Instructions; single-instruction, 
multiple-threads (SIMT)/Tensor Instructions), where a single instruction can be used to ini- 
tiate multiple operations. 

The number of PEs and the peak throughput of a single PE only indicate the theoretical 
maximum throughput (i.e., peak performance) when all PEs are performing computation (100% 
utilization). In reality, the achievable throughput depends on the actual utilization of those PEs, 
which is affected by several factors as follows: 

number of active PEs 


utilization of PEs = —mber oF PEs x utilization of active PEs. (3.3) 


‘The first term reflects the ability to distribute the workload to PEs, while the second term reflects 
how efficiently those active PEs are processing the workload. 

The number of active PEs is the number of PEs that receive work; therefore, it is desirable 
to distribute the workload to as many PEs as possible. The ability to distribute the workload is 
determined by the flexibility of the architecture, for instance the on-chip network, to support 
the layer shapes in the DNN model. 

Within the constraints of the on-chip network, the number of active PEs is also determined 
by the specific allocation of work to PEs by the mapping process. The mapping process involves 
the placement and scheduling in space and time of every MAC operation (including the delivery 
of the appropriate operands) onto the PEs. Mapping can be thought of as a compiler for the 
DNN hardware. The design of on-chip networks and mappings are discussed in Chapters 5 and 
6. 

The utilization of the active PEs is largely dictated by the timely delivery of work to the 
PEs such that the active PEs do not become idle while waiting for the data to arrive. This 
can be affected by the bandwidth and latency of the (on-chip and off-chip) memory and net- 
work. The bandwidth requirements can be affected by the amount of data reuse available in the 
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Figure 3.1: The roofline model. The peak operations per second is indicated by the bold line; 
when the operation intensity, which dictates by amount of compute per byte of data, is low, 
the operations per second is limited by the data delivery. The design goal is to operate as close as 
possible to the peak operations per second for the operation intensity of a given workload. 


DNN model and the amount of data reuse that can be exploited by the memory hierarchy and 
dataflow. The dataflow determines the order of operations and where data is stored and reused. 
The amount of data reuse can also be increased using a larger batch size, which is one of the 
reasons why increasing batch size can increase throughput. The challenge of data delivery and 
memory bandwidth are discussed in Chapters 5 and 6. The użilization of the active PEs can also 
be affected by the imbalance of work allocated across PEs, which can occur when exploiting 
sparsity (i.e., avoiding unnecessary work associated with multiplications by zero); PEs with less 
work become idle and thus have lower utilization. 

There is also an interplay between the number of PEs and the utilization of PEs. For 
instance, one way to reduce the likelihood that a PE needs to wait for data is to store some 
data locally near or within the PE. However, this requires increasing the chip area allocated to 
on-chip storage, which, given a fixed chip area, would reduce the number of PEs. Therefore, a 
key design consideration is how much area to allocate to compute (which increases the number 
of PEs) versus on-chip storage (which increases the utilization of PEs). 

‘The impact of these factors can be captured using Eyexam, which is a systematic way of 
understanding the performance limits for DNN processors as a function of specific character- 
istics of the DNN model and accelerator design. Eyexam includes and extends the well-known 
roofline model [119]. The roofline model, as illustrated in Figure 3.1, relates average bandwidth 
demand and peak computational ability to performance. Eyexam is described in Chapter 6. 

While the number of operations per inference in Equation (3.1) depends on the DNN 
model, the operations per second depends on both the DNN model and the hardware. For ex- 
ample, designing DNN models with efficient layer shapes (also referred to efficient network 
architectures), as described in Chapter 9, can reduce the number of MAC operations in the 
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DNN model and consequently the number of operations per inference. However, such DNN 
models can result in a wide range of layer shapes, some of which may have poor utilization of 
PEs and therefore reduce the overall operations per second, as shown in Equation (3.2). 

A deeper consideration of the operations per second, is that all operations are not created 
equal and therefore cycles per operation may not be a constant. For example, if we consider the 
fact that anything multiplied by zero is zero, some MAC operations are ineffectual (i.e., they do 
not change the accumulated value). The number of ineffectual operations is a function of both 
the DNN model and the input data. These ineffectual MAC operations can require fewer cycles 
or no cycles at all. Conversely, we only need to process effectual (or non-zero) MAC operations, 
where both inputs are non-zero; this is referred to as exploiting sparsity, which is discussed in 
Chapter 8. 

Processing only effectual MAC operations can increase the (total) operations per second 
by increasing the (total) operations per cycle. Ideally, the hardware would skip all ineffectual 
operations; however, in practice, designing hardware to skip all ineffectual operations can be 
challenging and result in increased hardware complexity and overhead, as discussed in Chap- 
ter 8. For instance, it might be easier to design hardware that only recognizes zeros in one of the 
operands (e.g., weights) rather than both. Therefore, the ineffectual operations can be further 
divided into those that are exploited by the hardware (i.e., skipped) and those that are unex- 
ploited by the hardware (i.e., not skipped). The number of operations actually performed by the 
hardware is therefore effectual operations plus unexploited ineffectual operations. 

Equation (3.4) shows how operations per cycle can be decomposed into 


1. the number of effectual operations plus unexploited ineffectual operations per cycle, which re- 
mains somewhat constant for a given hardware accelerator design; 


2. the ratio of effectual operations over effectual operations plus unexploited ineffectual operations, 
which refers to the ability of the hardware to exploit ineffectual operations (ideally unex- 
ploited ineffectual operations should be zero, and this ratio should be one); and 


3. the number of effectual operations out of (total) operations, which is related to the amount of 
sparsity and depends on the DNN model. 


As the amount of sparsity increases (i.e., the number of effectual operations out of (total) operations 
decreases), the operations per cycle increases, which subsequently increases operations per second, 
as shown in Equation (3.2): 


operations effectual operations + unexploited ineffectual operations 








cycle cycle 
effectual operations 1 
x z F z ; x ais 
effectual operations + unexploited ineffectual operations effectual operations 
operations 
(3.4) 


>By otal operations we mean both effectual and ineffectual operations. 
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Table 3.1: Classification of factors that affect inferences per second 


Factor Hardware | DNN Model | Input Data 


Operations per inference 





Operations per cycle 





Cycles per second 
Number of PEs 
Number of active PEs 











Utilization of active PEs 





Effectual operations out of (total) operations 





Effectual operations plus unexploited ineffectual 














operations per cycle 


However, exploiting sparsity requires additional hardware to identify when inputs are zero to 
avoid performing unnecessary MAC operations. The additional hardware can increase the criti- 
cal path, which decreases cycles per second, and also reduce area density of the PE, which reduces 
the number of PEs for a given area. Both of these factors can reduce the operations per second, as 
shown in Equation (3.2). Therefore, the complexity of the additional hardware can result in a 
trade off between reducing the number of unexploited ineffectual operations and increasing critical 
path or reducing the number of PEs. 

Finally, designing hardware and DNN models that support reduced precision (i.e., fewer 
bits per operand and per operations), which is discussed in Chapter 7, can also increase the 
number of operations per second. Fewer bits per operand means that the memory bandwidth 
required to support a given operation is reduced, which can increase the utilization of PEs since 
they are less likely to be starved for data. In addition, the area of each PE can be reduced, which 
can increase the number of PEs for a given area. Both of these factors can increase the operations 
per second, as shown in Equation (3.2). Note, however, that if multiple levels of precision need to 
be supported, additional hardware is required, which can, once again, increase the critical path 
and also reduce area density of the PE, both of which can reduce the operations per second, as 
shown in Equation (3.2). 

In this section, we discussed multiple factors that affect the number of inferences per 
second. Table 3.1 classifies whether the factors are dictated by the hardware, by the DNN model 
or both. 

In summary, the number of MAC operations in the DNN model alone is not sufficient for 
evaluating the throughput and latency. While the DNN model can affect the number of MAC 
operations per inference based on the network architecture (i.e., layer shapes) and the sparsity 
of the weights and activations, the overall impact that the DNN model has on throughput and 
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Figure 3.2: The number of MAC operations in various DNN models versus latency measured on 
Pixel phone. Clearly, the number of MAC operations is not a good predictor of latency. (Figure 
from [120].) 


latency depends on the ability of the hardware to add support to recognize these approaches 
without significantly reducing utilization of PEs, number of PEs, or cycles per second. ‘This is 
why the number of MAC operations is not necessarily a good proxy for throughput and latency 
(e.g., Figure 3.2), and it is often more effective to design efficient DNN models with hardware 
in the loop. Techniques for designing DNN models with hardware in the loop are discussed in 
Chapter 9. 

Similarly, the number of PEs in the hardware and their peak throughput are not sufficient 
for evaluating the throughput and latency. It is critical to report actual runtime of the DNN 
models on hardware to account for other effects such as utilization of PEs, as highlighted in 
Equation (3.2). Ideally, this evaluation should be performed on clearly specified DNN models, 
for instance those that are part of the MLPerf benchmarking suite. In addition, batch size should 
be reported in conjunction with the throughput in order to evaluate latency. 


3.3 ENERGY EFFICIENCY AND POWER CONSUMPTION 


Energy efficiency is used to indicate the amount of data that can be processed or the number of 
executions of a task that can be completed for a given unit of energy. High energy efficiency is 
important when processing DNNs at the edge in embedded devices with limited battery capacity 
(e.g., smartphones, smart sensors, robots, and wearables). Edge processing may be preferred 
over the cloud for certain applications due to latency, privacy, or communication bandwidth 
limitations. Energy efficiency is often generically reported as the number of operations per joule. 
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In the case of inference, energy efficiency is reported as inferences per joule or the inverse as 
energy consumption in terms of joules per inference. 

Power consumption is used to indicate the amount of energy consumed per unit time. In- 
creased power consumption results in increased heat dissipation; accordingly, the maximum 
power consumption is dictated by a design criterion typically called the thermal design power 
(TDP), which is the power that the cooling system is designed to dissipate. Power consumption 
is important when processing DNNs in the cloud as data centers have stringent power ceilings 
due to cooling costs; similarly, handheld and wearable devices also have tight power constraints 
since the user is often quite sensitive to heat and the form factor of the device limits the cool- 
ing mechanisms (e.g., no fans). Power consumption is typically reported in watts or joules per 
second. 

Power consumption in conjunction with energy efficiency limits the throughput as follows: 





(3.5) 


second second 


inferences joules inferences 
——_— < Max x l 


joule 


Therefore, if we can improve energy efficiency by increasing the number of inferences per joule, 
we can increase the number of inferences per second and thus throughput of the system. 

There are several factors that affect the energy efficiency. The number of inferences per 
joule can be decomposed into 





inferences operations 1 36 
j oule = joule X operations ” ( ; ) 
inference 


where the number of operations per joule is dictated by both the hardware and DNN model, 
while the number of operations per inference is dictated by the DNN model. 

There are various design considerations for the hardware that will affect the energy per 
operation (i.e., joules per operation). The energy per operation can be broken down into the 
energy required to move the input and output data, and the energy required to perform the 
MAC computation 


Energyroral = Energydata + Energymac. (3.7) 
For each component the joules per operation® is computed as 
joul 
a =a xC x Vpp’, (3.8) 
operation 


where C is the total switching capacitance, Vpp is the supply voltage, and œ is the switching 
activity, which indicates how often the capacitance is charged. 

‘The energy consumption is dominated by the data movement as the capacitance of data 
movement tends to be much higher that the capacitance for arithmetic operations such as a 


Here, an operation can be a MAC operation or a data movement. 
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Figure 3.3: The energy consumption for various arithmetic operations and memory accesses in 
a 45 nm process. The relative energy cost (computed relative to the 8b add) is shown on a log 
scale. The energy consumption of data movement (red) is significantly higher than arithmetic 
operations (blue). (Figure adapted from [121].) 


MAC (Figure 3.3). Furthermore, the switching capacitance increases the further the data needs 
to travel to reach the PE, which consists of the distance to get out of the memory where the 
data is stored and the distance to cross the network between the memory and the PE. Accord- 
ingly, larger memories and longer interconnects (e.g., off-chip) tend to consume more energy 
than smaller and closer memories due to the capacitance of the long wires employed. In or- 
der to reduce the energy consumption of data movement, we can exploit data reuse where the 
data is moved once from distant large memory (e.g., off-chip DRAM) and reused for multiple 
operations from a local smaller memory (e.g., on-chip buffer or scratchpad within the PE). Op- 
timizing data movement is a major consideration in the design of DNN accelerators; the design 
of the dataflow, which defines the processing order, to increase data reuse within the memory 
hierarchy is discussed in Chapter 5. In addition, advanced device and memory technologies can 
be used to reduce the switching capacitance between compute and memory, as described in 
Chapter 10. 

‘This raises the issue of the appropriate scope over which energy efficiency and power con- 
sumption should be reported. Including the entire system (out to the fans and power supplies) is 
beyond the scope of this book. Conversely, ignoring off-chip memory accesses, which can vary 
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greatly between chip designs, can easily result in a misleading perception of the efficiency of the 
system. Therefore, it is critical to not only report the energy efficiency and power consumption 
of the chip, but also the energy efficiency and power consumption of the off-chip memory (e.g., 
DRAM) or the amount of off-chip accesses (e.g., DRAM accesses) if no specific memory tech- 
nology is specified; for the latter, it can be reported in terms of the total amount of data that is 
read and written off-chip per inference. 

Reducing the joules per MAC operation itself can be achieved by reducing the switching 
activity and/or capacitance at a circuit level or micro-architecture level. This can also be achieved 
by reducing precision (e.g., reducing the bit width of the MAC operation), as shown in Figure 3.3 
and discussed in Chapter 7. Note that the impact of reducing precision on accuracy must also 
be considered. 

For instruction-based systems such as CPUs and GPUs, this can also be achieved 
by reducing instruction bookkeeping overhead. For example, using large aggregate instruc- 
tions (e.g., single-instruction, multiple-data (SIMD)/Vector Instructions; single-instruction, 
multiple-threads (SIMT)/Tensor Instructions), a single instruction can be used to initiate mul- 
tiple operations. 

Similar to the throughput metric discussed in Section 3.2, the number of operations per 
inference depends on the DNN model, however the operations per joules may be a function of 
the ability of the hardware to exploit sparsity to avoid performing ineffectual MAC operations. 
Equation (3.9) shows how operations per joule can be decomposed into: 


1. the number of effectual operations plus unexploited ineffectual operations per joule, which re- 
mains somewhat constant for a given hardware architecture design; 


2. the ratio of effectual operations over effectual operations plus unexploited ineffectual operations, 
which refers to the ability of the hardware to exploit ineffectual operations (ideally unex- 
ploited ineffectual operations should be zero, and this ratio should be one); and 


3. the number of effectual operations out of (total) operations, which is related to the amount of 
sparsity and depends on the DNN model. 


operations effectual operations + unexploited ineffectual operations 





joule joule 
effectual operations 





(3.9) 


x 
effectual operations + unexploited ineffectual operations 
1 


a effectual operations ` 
operations 
For hardware that can exploit sparsity, increasing the amount of sparsity (i.e., decreasing 
the number of effectual operations out of (total) operations) can increase the number of operations 
per joule, which subsequently increases inferences per joule, as shown in Equation (3.6). While 
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exploiting sparsity has the potential of increasing the number of (total) operations per joule, the 
additional hardware will decrease the effectual operations plus unexploited ineffectual operations per 
Joule. In order to achieve a net benefit, the decrease in effectual operations plus unexploited inef- 
fectual operations per joule must be more than offset by the decrease of effectual operations out of 
(total) operations. 

In summary, we want to emphasize that the number of MAC operations and weights in 
the DNN model are not sufficient for evaluating energy efficiency. From an energy perspective, 
all MAC operations or weights are not created equal. This is because the number of MAC 
operations and weights do not reflect where the data is accessed and how much the data is 
reused, both of which have a significant impact on the operations per joule. Therefore, the number 
of MAC operations and weights is not necessarily a good proxy for energy consumption and it 
is often more effective to design efficient DNN models with hardware in the loop. Techniques 
for designing DNN models with hardware in the loop are discussed in Chapter 9. 

In order to evaluate the energy efficiency and power consumption of the entire system, it is 
critical to not only report the energy efficiency and power consumption of the chip, but also the 
energy efficiency and power consumption of the off-chip memory (e.g., DRAM) or the amount 
of off-chip accesses (e.g., DRAM accesses) if no specific memory technology is specified; for the 
latter, it can be reported in terms of the total amount of data that is read and written off-chip 
per inference. As with throughput and latency, the evaluation should be performed on clearly 
specified, ideally widely used, DNN models. 


3.4 HARDWARE COST 


In order to evaluate the desirability of a given architecture or technique, it is also important to 
consider the hardware cost of the design. Hardware cost is used to indicate the monetary cost to 
build a system.’ This is important from both an industry and a research perspective to dictate 
whether a system is financially viable. From an industry perspective, the cost constraints are 
related to volume and market; for instance, embedded processors have a much more stringent 
cost limitations than processors in the cloud. 

One of the key factors that affect cost is the chip area (e.g., square millimeters, mm?) in 
conjunction with the process technology (e.g., 45 nm CMOS), which constrains the amount of 
on-chip storage and amount of compute (e.g., the number of PEs for custom DNN accelera- 
tors, the number of cores for CPUs and GPUs, the number of digital signal processing (DSP) 
engines for FPGAs, etc.). To report information related to area, without specifying a specific 


7 There is also cost associated with operating a system, such as the electricity bill and the cooling cost, which are primarily 
dictated by the energy efficiency and power consumption, respectively. There is also cost associated with designing the system. 
‘The operating cost is covered by the section on energy efficiency and power consumption and we limited our coverage of design 
cost to the fact that custom DNN accelerators have a higher design cost than off-the-shelf CPUs and GPUs. We consider 
anything beyond this, e.g., the economics of the semiconductor business, including how to price platforms, is outside the 


scope of this book. 
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process technology, the amount of on-chip memory (e.g, storage capacity of the global buffer) 
and compute (e.g., number of PEs) can be used as a proxy for area. 

Another important factor is the amount of off-chip bandwidth, which dictates the cost 
and complexity of the packaging and printed circuit board (PCB) design (e.g., High Bandwidth 
Memory (HBM) [122] to connect to off-chip DRAM, NVLink to connect to other GPUs, etc.), 
as well as whether additional chip area is required for a transceiver to handle signal integrity at 
high speeds. The off-chip bandwidth, which is typically reported in gigabits per second (Gbps), 
sometimes including the number of I/O ports, can be used as a proxy for packaging and PCB 
cost. 

There is also an interplay between the costs attributable to the chip area and off-chip 
bandwidth. For instance, increasing on-chip storage, which increases chip area, can reduce off- 
chip bandwidth. Accordingly, both metrics should be reported in order to provide perspective 
on the total cost of the system. 

Of course reducing cost alone is not the only objective. The design objective is invariably to 
maximize the throughput or energy efficiency for a given cost, specifically, to maximize inferences 
per second per cost (e.g., $) and/or inferences per joule per cost. This is closely related to the previously 
discussed property of utilization; to be cost efficient, the design should aim to utilize every PE 
to increase inferences per second, since each PE increases the area and thus the cost of the chip; 
similarly, the design should aim to effectively utilize all the on-chip storage to reduce off-chip 
bandwidth, or increase operations per off-chip memory access as expressed by the roofline model 
(see Figure 3.1), as each byte of on-chip memory also increases cost. 


3.5 FLEXIBILITY 


The merit of a DNN accelerator is also a function of its flexibility. Flexibility refers to the range of 
DNN models that can be supported on the DNN processor and the ability of the software envi- 
ronment (e.g., the mapper) to maximally exploit the capabilities of the hardware for any desired 
DNN model. Given the fast-moving pace of DNN research and deployment, it is increasingly 
important that DNN processors support a wide range of DNN models and tasks. 

We can define support in two tiers: the first tier requires that the hardware only needs 
to be able to functionally support different DNN models (i-e., the DNN model can run on the 
hardware). ‘The second tier requires that the hardware should also maintain efficiency (i.e., high 
throughput and energy efficiency) across different DNN models. 

To maintain efficiency, the hardware should not rely on certain properties of the DNN 
models to achieve efficiency, as the properties cannot be guaranteed. For instance, a DNN ac- 
celerator that can efficiently support the case where the entire DNN model (i.e., all the weights) 
fits on-chip may perform extremely poorly when the DNN model grows larger, which is likely 
given that the size of DNN models continue to increase over time, as discussed in Section 2.4.1; 
a more flexible processor would be able to efficiently handle a wide range of DNN models, even 
those that exceed on-chip memory. 
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The degree of flexibility provided by a DNN accelerator is a complex trade-off with accel- 
erator cost. Specifically, additional hardware usually needs to be added in order to flexibly sup- 
port a wider range of workloads and/or improve their throughput and energy efficiency. We all 
know that specialization improves efficiency; thus, the design objective is to reduce the overhead 
(e.g., area cost and energy consumption) of supporting flexibility while maintaining efficiency 
across the wide range of DNN models. Thus, evaluating flexibility would entail ensuring that 
the extra hardware is a net benefit across multiple workloads. 

Flexibility has become increasingly important when we factor in the many techniques that 
are being applied to the DNN models with the promise to make them more efficient, since they 
increase the diversity of workloads that need to be supported. These techniques include DNNs 
with different network architectures (i.e., different layer shapes, which impacts the amount of 
required storage and compute and the available data reuse that can be exploited), as described 
in Chapter 9, different levels of precision (i.e., different number of bits for across layers and data 
types), as described in Chapter 7, and different degrees of sparsity (i.e., number of zeros in the 
data), as described in Chapter 8. There are also different types of DNN layers and computation 
beyond MAC operations (e.g., activation functions) that need to be supported. 

Actually getting a performance or efficiency benefit from these techniques invariably re- 
quires additional hardware, because a simpler DNN accelerator design may not benefit from 
these techniques. Again, it is important that the overhead of the additional hardware does not 
exceed the benefits of these techniques. This encourages a hardware and DNN model co-design 
approach. 

To date, exploiting the flexibility of DNN hardware has relied on mapping processes that 
act like static per-layer compilers. As the field moves to DNN models that change dynamically, 
mapping processes will need to dynamically adapt at runtime to changes in the DNN model or 
input data, while still maximally exploiting the flexibility of the hardware to improve efficiency. 

In summary, to assess the flexibility of DNN processors, its efficiency (e.g., inferences 
per second, inferences per joule) should be evaluated on a wide range of DNN models. The 
MLPerf benchmarking workloads are a good start; however, additional workloads may be needed 
to represent efficient techniques such as efficient network architectures, reduced precision and 
sparsity. The workloads should match the desired application. Ideally, since there can be many 
possible combinations, it would also be beneficial to define the range and limits of DNN models 
that can be efficiently supported on a given platform (e.g., maximum number of weights per 
filter or DNN model, minimum amount of sparsity, required structure of the sparsity, levels of 
precision such as 8-bit, 4-bit, 2-bit, or 1-bit, types of layers and activation functions, etc.). 


3.6 SCALABILITY 


Scalability has become increasingly important due to the wide use cases for DNNs and emerging 
technologies used for scaling up not just the size of the chip, but also building systems with mul- 
tiple chips (often referred to as chiplets) [123] or even wafer-scale chips [124]. Scalability refers 


3.7. INTERPLAY BETWEEN DIFFERENT METRICS 57 


to how well a design can be scaled up to achieve higher throughput and energy efficiency when 
increasing the amount of resources (e.g., the number of PEs and on-chip storage). This evalua- 
tion is done under the assumption that the system does not have to be significantly redesigned 
(e.g., the design only needs to be replicated) since major design changes can be expensive in 
terms of time and cost. Ideally, a scalable design can be used for low-cost embedded devices and 
high-performance devices in the cloud simply by scaling up the resources. 

Ideally, the throughput would scale linearly and proportionally with the number of PEs. 
Similarly, the energy efficiency would also improve with more on-chip storage, however, this 
would be likely be nonlinear (e.g., increasing the on-chip storage such that the entire DNN 
model fits on chip would result in an abrupt improvement in energy efficiency). In practice, this 
is often challenging due to factors such as the reduced utilization of PEs and the increased cost 
of data movement due to long distance interconnects. 

Scalability can be connected with cost efficiency by considering how inferences per second 
per cost (e.g., $) and inferences per joule per cost changes with scale. For instance, if throughput 
increases linearly with number of PEs, then the inferences per second per cost would be constant. 
It is also possible for the inferences per second per cost to improve super-linearly with increasing 
number of PEs, due to increased sharing of data across PEs. 

In summary, to understand the scalability of a DNN accelerator design, it is important 
to report its performance and efficiency metrics as the number of PEs and storage capacity 
increases. This may include how well the design might handle technologies used for scaling up, 
such as inter-chip interconnect. 


3.7 INTERPLAY BETWEEN DIFFERENT METRICS 


It is important that all metrics are accounted for in order to fairly evaluate all the design trade- 
offs. For instance, without the accuracy given for a specific dataset and task, one could run a 
simple DNN and easily claim low power, high throughput, and low cost—however, the pro- 
cessor might not be usable for a meaningful task; alternatively, without reporting the off-chip 
bandwidth, one could build a processor with only multipliers and easily claim low cost, high 
throughput, high accuracy, and low chip power—however, when evaluating system power, the 
off-chip memory access would be substantial. Finally, the test setup should also be reported, 
including whether the results are measured or obtained from simulation? and how many images 
were tested. 

In summary, the evaluation process for whether a DNN system is a viable solution for a 
given application might go as follows: 


1. the accuracy determines if it can perform the given task; 


2. the latency and throughput determine if it can run fast enough and in real time; 


8]f obtained from simulation, it should be clarified whether it is from synthesis or post place-and-route and what library 
corner (e.g., process corner, supply voltage, temperature) was used. 
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ON 


KEY METRICS AND DESIGN OBJECTIVES 


. the energy and power consumption will primarily dictate the form factor of the device 


where the processing can operate; 


the cost, which is primarily dictated by the chip area and external memory bandwidth 
requirements, determines how much one would pay for this solution; 


. flexibility determines the range of tasks it can support; and 


. the scalability determines whether the same design effort can be amortized for deployment 


in multiple domains, (e.g., in the cloud and at the edge), and if the system can efficiently 
be scaled with DNN model size. 
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CHAPTER 4 


Kernel Computation 


The fundamental computation of both CONV and FC layers described in Chapter 2 are 
multiply-and-accumulate (MAC) operations. Because there are negligible dependencies be- 
tween these operations and the accumulations are commutative, there is considerable flexibility 
in the order in which MACs can be scheduled and these computations can be easily parallelized. 
Therefore, in order to achieve high performance for DNNsg, highly parallel compute paradigms 
are very commonly used. These architectural paradigms can be categorized as being either tem- 
poral or spatial, as shown in Figure 4.1. 

Temporal architectures use centralized control for a large number of arithmetic logic 
units (ALUs). These ALUs typically can only fetch data from the memory hierarchy and can- 
not communicate directly with each other. Such architectures, which appear mostly in CPUs 
or GPUs, employ a variety of techniques to improve parallelism such as vector instructions 
(e.g., single-instruction-multiple-data, SIMD, instructions) or parallel threads (e.g., single- 
instruction-multiple-thread, SIMT, architectures). In contrast, spatial architectures allow for 
communication between ALUs, and use dataflow processing (i.e., the ALUs form a processing 
chain so that they can pass data from one to another directly). Sometimes each ALU can have 
its own control logic and local memory, called a scratchpad or register file. We refer to an ALU 
with its own local memory as a processing engine (PE). Spatial architectures are commonly used 
for processing DNNs in ASIC- and FPGA-based designs. 

With the rise in popularity of DNNs, many programmable temporal systems (i.e., CPUs 
and GPUs) started adding features that target DNN processing. For instance, the Intel Knights 
Landing CPU featured special vector instructions for deep learning that performed multiple 
fused multiply accumulate operations; the Nvidia PASCAL GP100 GPU featured 16-bit float- 
ing point (fp16) arithmetic support to perform two fp16 operations on a single precision core 
for faster deep learning computation. As will be described in Section 4.1, DNN calculations can 
often be cast as matrix multiplications. As a result, the Nvidia VOLTA GV100 GPU featured 
a special compute unit for performing matrix multiplication and accumulation. Activity on that 
unit is invoked with individual instructions that perform many MAC operations. 

We also have been seeing systems built specifically for DNN processing such as Face- 
book's Big Basin custom DNN server [125] and Nvidias DGX-1. DNN inference also started 
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Figure 4.1: Highly parallel compute paradigms. 


appearing on various embedded System-on-Chips (SoC!) such as Apple’s A, Nvidia’s Tegra, 
and Samsung’s Exynos. 

In this chapter and Chapter 5, we will discuss the different design strategies for efficient 
processing on these different platforms, without any impact on accuracy (i.e., all approaches in 
this chapter produce bit-wise identical results”); specifically, 


e for zemporal architectures such as CPUs and GPUs, we will discuss how DNN algorithms 
can be mapped and optimized on these platforms and how computational transforms on the 
kernel can reduce the number of multiplications to increase throughput and how the com- 
putation (e.g., MACs) can be ordered (i.e., żied) to improve memory subsystem behavior 
(this chapter); and 


e for spatial architectures used in accelerators, we will discuss how dataffows can increase 
data reuse from low cost memories in the memory hierarchy to reduce energy consumption 
and how other architectural features can help optimize data movement (Chapter 5). 


1A system-on-chip (SoC) refers to when a CPU and accelerators such as GPUs or application specific processing modules 
for video compression engines and baseband communications, are all integrated on a chip. 

?There may be some minor mismatch due to order of operations for floating point operations, however this does not 
affect accuracy. 
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4.1 MATRIX MULTIPLICATION WITH TOEPLITZ 


CPUs and GPUs use hardware parallelizaton techniques such as SIMD or SIMT to perform 
the MACs in parallel. All the ALUs share the same control and memory (register file). Such 
a scheme is naturally amenable to perform the many regular parallel multiplications found in 
matrix-matrix or matrix-vector multiplication.’ Therefore, the kernel computation of both the 
FC and CONV layers are often mapped to matrix multiplication. Figure 4.2 shows how a matrix 
multiplication is used for the FC layer. The height of the filter matrix is the number of 3-D filters 
(M) and the width is the number of weights per 3-D filter (input channels (C) x height (H) 
x width (W), since the filter height (R) equals H and the filter width (S) equals W in the FC 
layer); the height of the input feature maps matrix is the number of activations per 3-D input 
feature map (C x H x W), and the width is the number of 3-D input feature maps, also referred 
to as the batch size (one in Figure 4.2a and N in Figure 4.2b); finally, the height of the output 
feature map matrix is the number of channels in the output feature maps (M), and the width is 
the number of 3-D output feature maps/batch size (N), where each output feature map of the 
FC layer has the dimension of 1x1xnumber of output channels (M). 

The CONV layer in a DNN can also be mapped to a matrix multiplication using a relaxed 
form of the Toeplitz matrix, as shown in Figure 4.3. In this form, the input activations in the 
input feature map are replicated to correspond to the input activation convolutional reuse. The 
downside of using matrix multiplication for the CONV layers is that there is redundant data in 
the input feature map matrix, as highlighted in Figure 4.3a. This can lead to either inefficiency 
in storage, or a complex memory access pattern. 

Since convolving an input by a filter is mathematically equivalent to convolving the filter by 
the input, one can also convert convolution into a matrix multiply by replicating the filter weights 
to correspond to the filter weight convolutional reuse. Such a transformation is illustrated in 
Figure 4.4. As was the case when replicating input activations the downside is the redundant 
data in the filter matrix. Furthermore, since the filter size is typically much smaller than the size 
of the input feature map, this transformation results in a sparse matrix. The combination of these 
factors can again lead to either inefficiency in storage or a complex memory access pattern. 


4.2 TILING FOR OPTIMIZING PERFORMANCE 


As described in the previous section, many DNN computations can be formulated as a matrix 
multiplication. To efficiently perform these computations, there are many software libraries de- 
signed for CPUs (e.g., OpenBLAS, Intel MKL, etc.) and GPUs (e.g., cuBLAS, cuDNN, etc.) 
that have optimized implementations of matrix multiplication. A key characteristic of these li- 
braries is that they strive to optimize memory subsystem behavior. In specific, given the conven- 
tional memory subsystem organization that places increasingly smaller, faster, and lower energy 


3These operations are often implemented in generalized matrix-matrix multiplication (GEMM) or generalized matrix- 


vector multiplication (GEMV) libraries. 
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(a) Matrix vector multiplication is used when computing a single output feature map from 
a single input feature map. 
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(b) Matrix multiplications is used when computing N output feature maps from N input feature maps. 








Figure 4.2: Mapping to matrix multiplication for FC layers (where filter size is equal to input 
feature map size, such that R = H and S = W); in other words, CHW in the figure is the 
same as CRS. A batch size of N = 1 results in a matrix-vector multiplication, while a batch size 
greater than 1 (N > 1) results in a matrix-matrix multiplication. 


consuming memories closer to the compute units, the libraries attempt to maximize reuse of the 
values held in the smaller, faster, and more energy-efficient memories. 

To understand how one might maximize reuse of values in the memories closest to the 
compute units, consider a naive implementation of matrix multiplication used in a fully con- 
nected computation, as illustrated in Figure 4.5. The figure shows how rows of the filter weight 
matrix are combined with columns of the input feature map. This combination involves the 
element-wise multiplication of the values from the filter row and the input feature map column 
and final summing of all the elements of the resulting vector (i.e., doing an inner product). Each 
such inner product produces the value for one element of the output feature map matrix. This 
computation style is referred to as the inner-product approach to matrix multiplication. 
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(a) Mapping convolution to matrix multiplication. Convolution is performed via matrix multiplication 
after flattening the filter weights and output fmap into vectors and expanding the input fmap values 
through replication into the Toeplitz matrix. 
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(b) Extend Toeplitz matrix to multiple input and output channels. Additional rows and columns corre- 
sponding to additional input and output channels are added to the filter weights, output fmap, and 
Toeplitz matrix to perform multiple input channel/multiple output channel convolution via matrix 
multiplication. 


Figure 4.3: Mapping to matrix multiplication for CONV layers. (Continues.) 
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(c) Dimensions of matrix multiplication, where P = (H — R + U)/U and Q = (W — S + U)/U, as defined in 
Equation (2.1) and Table 2.1. 


Figure 4.3: (Continued.) Mapping to matrix multiplication for convolutional layers. 
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Figure 4.4: Mapping to matrix multiplication for CONV layers by weight replication. 
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Figure 4.5: Illustration of dot product operation on a row of the filter weight matrix and column 
of the input feature map (fmap). 


If the inner-product computation is ordered such that a single row of the filter matrix is 
combined successively with each of the columns of the input feature map, then there is apparently 
good reuse of the elements of the row of the filter matrix. However, often the size of a row 
in the filter matrix* is larger than the memory closest to the compute units, e.g., cache. This 
results in poor reuse, because values from the filter matrix in the small memory cannot be held 
long enough that they are available to be reused by the computation on the next column of the 
input feature map matrix. Therefore, they must be reloaded from the next level of the memory 
hierarchy resulting in significant inefficiency. 

To ameliorate the memory inefficiencies that result from calculating the inner products on 
full rows and columns of a matrix multiply, libraries will invariably partition or si/e the computa- 
tion to fit in the various levels of the memory hierarchy. The principle behind tiling is illustrated 
in Figure 4.6, where the inner products are done on a 2-D partition, or tile, of the full matrices. 
For each pair of tiles in the matrices, the same inner-product approach can be employed on the 
partial rows of filter weights and partial columns in the input feature map matrix to create a tile 
of partial results in the output feature map. As computations for all the pairs of tiles are done, 
the subsequent partial results are added to the partial results in the output feature map from 
previous partial result computations. If a single tile of filter weights is used repeatedly to create 
a series of partial results, and if the tile is small enough to be held in the memory closest to the 
compute units, then reuse in that memory will be higher. 

Tiling of matrix multiply can be applied recursively to improve efficiency at each level of 
the memory hierarchy. Tiling can also be applied to parallelize the computation across multiple 
CPUs or the many threads of a GPU. Therefore, CPU and GPU libraries for matrix multipli- 
cation have been optimized by tiling the computations appropriately for every architecture of 


4Note that each row of the filter matrix represents a filter, and the size of the row is the size of the filter, which for fully 
connected is C x H x W. 
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Figure 4.6: Illustration of dot product operation on a tiled row of a filter weight matrix and a 
tiled column of the input feature map (fmap). Note that the output is highlighted with a dotted 
line since only a partial value has been computed (from tiles Fo,o and o,o). To complete the 
computation of the output, one needs to perform the dot product operation on tiles Fo,; and 
Io and accumulate with the previous result. 


interest based on characteristics like sizes of the cache at each level in the memory hierarchy and 
the topology of parallel computation units. 

Tiling algorithms require considerable sophistication. For instance, an additional com- 
plication arises because set associative caches have a policy to determine which data is retained 
when new data needs to be added (i.e., on a cache miss). These policies are implemented with 
sophisticated hardware-based replacement algorithms (e.g., least recently used (LRU) or dy- 
namic re-reference interval prediction (DRRIP) [126]). Therefore, the libraries also have to try 
to account for exactly how tiles will be retained to achieve optimal performance. To handle the 
wide variety of hardware platforms and layer shapes, these libraries often have different imple- 
mentations of an algorithm, like matrix multiply, that are optimized for a particular hardware 
platform and layer shape. This optimization invariably includes the tiling strategy. So, when the 
library is called, it will dynamically select and run the most appropriate implementation. 

In addition to the pre-existing libraries that dynamically pick from a menu of implemen- 
tations, there has been considerable work on compilers that optimize a user-written program 
for tiling, e.g., [127]. One segment of this work relies on creating a polyhedral model of the 
computation and using a boolean satisifiability (SAT) solver to optimally tile and schedule the 
program [128]. Such techniques have been added to the popular GCC and LLVM compiler 
frameworks. Halide [129] is an example of another approach that decouples the basic expression 
of the algorithm from user-provided annotations that describe the desired scheduling and tiling 
of the algorithm. Finally, TVM [130] is a compiler that exposes graph-level and operator-level 
optimizations for DNN workloads across diverse hardware back-ends. In addition to tiling for 
hiding memory latency, TVM performs optimization such as high-level operator fusion (e.g., 
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performing a CONV layer and ReLU together with one pass through memory), and mapping 


to arbitrary hardware primitives. 

As will be discussed in Chapter 5, the concept of tiling also applies to specialized DNN 
architectures that perform matrix multiplication as well as those that directly perform convolu- 
tions. In such architectures, tiling is also done to maximize data reuse in the memory hierarchy 
and by parallel computation units. However, the tiling is not done in caches (i.e., implicitly 
data orchestrated units) but rather in more specialized buffers (i.e., explicitly data orchestrated 
units), whose data retention characteristics are more directly under program control, as described 
in Section 5.8. 


4.3 COMPUTATION TRANSFORM OPTIMIZATIONS 


DNN calculations can sometimes be further sped up by applying computational transforms to 
the data to reduce the number of (typically expensive) multiplications, while still giving the same 
bit-wise result. The objective is to improve performance or reduce energy consumption, although 
this can come at a cost of more intermediate results, an increased number of additions, and a 
more irregular data access pattern. 


4.3.1 GAUSS’ COMPLEX MULTIPLICATION TRANSFORM 


One way to view these transforms is as more sophisticated versions of Gauss’ technique for mul- 
tiplying complex numbers. In the standard approach for complex multiplication the computation 
(a + bi)(c + di) is performed with a full set of cross-term multiplications as follows: 


(ac — bd) + (bc —ad)i. 


In this case, the computation requires 4 multiplications and 3 additions. However, the computa- 
tion can be transformed by a re-association of operations into the computation of three inter- 
mediate terms and then the real and imaginary parts of the result are computed via a sum and 
difference of the intermediate terms (k;) as follows: 


kı = c(a +b) 
ky =a(d — c) 
k3 = b(c+d) 


Real part = kı — k3 
Imaginary part = kı + k2. 


In the transformed form, the computation is reduced to 3 multiplications and 5 additions. In the 
following sections, we will discuss more sophisticated re-association of computations for matrix 
multiplication and direct convolution that can result in reducing the number of multiplications 
in those computations. 
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4.3.2 STRASSEN’S MATRIX MULTIPLICATION TRANSFORM 


For matrix multiplication, a well-known re-association technique is Strassen’s algorithm, which 
has been explored for reducing the number of multiplications in DNNs [131]. Strassen’s algo- 
rithm rearranges the computations of a matrix multiplication in a recursive manner to reduce 
the number of multiplications from O(N?) to O(N?8°7). An illustration of the application of 
the Strassen algorithm to a 2x2 matrix multiplication is shown as follows: 


ae+bg af +bh 
ce+dg ef +dh\- 





In this example, the 8 multiplications and 4 additions are converted into 7 multiplications and 18 
additions along with the creation of 7 intermediate values (k;) as follows: 


kı =a(f —h) 
k2 = (a + b)h 
k3 = (c + d)e 
k4 = d(g — e) 


ks = (a + d)(e +h) 
ke = (b — d)\(g +h) 
kı = (a —c)(e + f). 


The final output AB is constructed from the intermediate (k;) values as follows: 


A kı + k2 | 


k3 + k4 kı + k5 — k3 — k7 


Note that, when used for DNN calculations, one of the matrices will contain filter weights, 
which will be constant across different inputs. In this example, pre-calculation using a constant 
B matrix reduces the number of additions to 13. In summary, Strassen can be used to reduce the 
number of multiplications, but its benefits come at the cost of increased storage requirements for 
intermediate results and sometimes reduced numerical stability [132]. Furthermore, the benefit 
of Strassen is primarily manifest for matrices larger than those typically used in DNNs. 


4.3.3 WINOGRAD TRANSFORM 


Winograd’s algorithm [133, 134] applies a re-association of the arithmetic operations on the 
feature map and filter to reduce the number of multiplications required specifically for convo- 
lution, as opposed to a generic matrix multiply handled by the previously described transforms. 
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The Winograd transform allows for the efficient computation of multiple convolutions using the 
same filter weights. For example, the computation of two 1x3 convolutions of input activations 
(i;) and filter weights (fj), normally takes 6 multiplications and 4 additions as follows: 


a esto 

lo 11 12 00 
op fA | = . 
li 12 13 O71 


However, using the Winograd re-association, this reduces to 4 multiplications and 12 additions 
along with 2 shifts (to implement divide by 2) using 4 intermediate values (kj) as follows: 


kı = (io — i2) fo 








ko = (i14 oe laa ae 
k3 = (i2 jist h 


k4 = (i1 — i3) fo. 


The final outputs (0;) are constructed from the intermediate (k;) values as follows: 
oo| _ [ki + k2 + k3 
[a 7 ik She a 

With constant filter weights, this reduces to 4 multiplications and 8 additions. Note that each 
application of the Winograd transform only does a convolution on a small number of input 
activations (i.e., a tile of input activations). Therefore, to do the entire set of convolutions for 
an input feature map requires the application of Winograd on a tile-by-tile basis; as a result, a 
series of separate tiles of output are generated using a sliding window of input activations (i.e., 
inputs are reused across output tiles). Winograd transforms can apply to 2-D convolutions by 
repeated application of the transform. 

The reduction in multiplications that Winograd achieves varies based on the filter and 
tile size. A larger tile size results in a larger reduction in multiplications at the cost of higher 
complexity transforms. A particularly attractive filter size is 3x3, which can reduce the number 
of multiplications by 2.25x when computing a tile of 2x2 outputs. Note that Winograd re- 


quires specialized processing depending on the size of the filter and tile, so Winograd hardware 
typically support only specific tile and filter sizes.” 


5For instance, NVDLA only supports 3x3 filters [135]. 
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A matrix linear algebraic formulation of Winograd is shown as follows: 
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In this formulation the first steps of the computation are transformations of both the filter 
weights ( f; ) and input activations (i; ) by sandwiching those matrices in a chain of matrix multi- 
plies by constant matrices, Gf GT and BiB, respectively. The resulting values can be considered 
as existing in a “Winograd” space, where a convolution can be performed by combining those 
matrices with element-wise multiplication, which is computationally efficient, as follows: 


[GfG7] © [B7iB]. 


Finally, a reverse transformation out of the “Winograd” space is performed by, again, sandwich- 
ing the result of the element-wise multiplication in a chain of matrix multiplies by constant 
matrices AT and A. 


Y = ATl[GfG7] o [BT iB] A. 


Note that since the filter weights are constant across many applications of the tiled convolution, 
the transformation of the filter weights into the “Winograd” space, Gf GT, only needs to be 
performed once. 


4.3.4 FAST FOURIER TRANSFORM 


The Fast Fourier Transform (FFT) [19, 136] follows a similar pattern to the Winograd trans- 
form to convert a convolution into a new space where convolution is more computationally ef- 
ficient. This well-known approach, shown in Figure 4.7, reduces the number of multiplications 
for each input channel from O(RSPQ) to O( PQ log, PQ), where the output size is P x Q and 
the filter size is R x S.° To perform the convolution, we take the FFT of the filter and input 


Note that convolutions for DNNs do not assume any zero padding, which differ from the traditional forms of convolu- 
tions associated with FFTs. For more on this, please see [137]. 
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Figure 4.7: FFT to accelerate DNN. 


feature map, and then perform the element-wise multiplication in the frequency domain; we 
then apply an inverse FFT to the resulting product to recover the output feature map in the 
spatial domain. However, there are several drawbacks to using FFT: (1) the benefits of FFTs 
decrease with filter size;’ (2) the size of the FFT is dictated by the output feature map size which 
is often much larger than the filter; and (3) the coefficients in the frequency domain are complex. 
As a result, while FFT reduces computation, it requires larger storage capacity and bandwidth. 
Finally, a popular approach for reducing complexity is to make the weights sparse, which will 
be discussed in Section 8.1.2; using FFTs makes it difficult for this sparsity to be exploited. 
Several optimizations can be performed on FFT to make it more effective for DNNs. 
To reduce the number of operations, the FFT of the filter can be precomputed and stored. In 
addition, the FFT of the input feature map can be computed once and used to generate multiple 
channels in the output feature map. Finally, since an image contains only real values, its Fourier 
Transform is symmetric and this can be exploited to reduce storage and computation cost. 


4.3.5 SELECTING A TRANSFORM 


In practice, different algorithms might be used for different layer shapes and sizes (e.g., FFT for 
filters greater than 5x5, and Winograd for filters 3x3 and below). Existing platform libraries, 
such as MKL and cuDNN, dynamically choose the appropriate algorithm for a given shape and 
size [138, 139]. 


7The benefits of FFT depends on the ratio between the filter size R x S and the output size P x Q. Specifically, after 
accounting for constant terms, one needs RS > logy PQ for there to be a benefit. 
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44 SUMMARY 


In this chapter, we discussed different approaches for achieving efficient processing on tempo- 
ral platforms, such as CPUs and GPUs. The objective was to restructure the computations to 
improve efficiency without any impact on accuracy (i.e., all approaches in this chapter produced 
near bit-wise identical results). These approaches can target reducing memory bandwidth (e.g., 
via tiling); reshaping the computation to be more efficient; or reducing high-cost operations. 
One significant example of reshaping the computation was the Toeplitz transformation, which 
is widely used in CPUs and GPUs. The Toeplitz transformation converts a convolution into 
a matrix multiply by replicating values, which allows application of routines from any of the 
highly optimized matrix multiply libraries. Finally, a variety of other transformations are aimed 
at reducing the number of high-cost operations (i.e., multiplies) through algebraic re-association 
including the Strassen, Winograd, and FFT transforms. 
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CHAPTER 5 


Designing DNN Accelerators 


In Chapter 4, we discussed how DNN processing can undergo transforms to leverage optimized 
libraries or reduce the number of operations, specifically multiplications, in order to achieve 
higher performance (i.e., higher throughput and/or lower latency) on off-the-shelf general- 
purpose processors such as CPUs and GPUs. In this chapter, we will focus on optimizing the 
processing of DNNs directly by designing specialized hardware. 

A major motivation for designing specialized hardware instead of just trying to improve 
general-purpose processors is described by John Hennessy and Dave Patterson in their 2018 
Turing Award Lecture [11]. In that lecture, they argue that with the end of Moore’s law [140] 
there is a need to employ domain-specific hardware/software co-design (e.g., domain-specific 
languages such as TensorF low) in computing systems to continue to improve performance and 
energy efficiency for important computational domains. 

An architectural recipe for designing such specialized systems is outlined in Leiserson et 
al. [141]. That article describes how improving performance and/or energy efficiency can be 
achieved by: (1) identifying opportunities for significant parallelism and data locality in work- 
loads in the domain of interest/importance; (2) design a hardware organization that exploits 
that parallelism and data locality; and (3) streamline that hardware to maximize efficiency, pos- 
sibly through hardware-software as well as hardware-algorithm co-design. Here, we distinguish 
between these two forms of co-design: hardware-software co-design refers to the development 
of new software and languages which improves ease of use; furthermore, the compiler can map 
such workloads better to domain-specific hardware to enable improvements in performance and 
energy efficiency. Hardware-algorithm co-design refers to modifying the algorithm, and thus 
its workloads, in conjunction with the hardware for improvements in performance and energy 
efficiency that could not achieved by each approach individually. Note that in this book we 
primarily focus on hardware-algorithm co-design.! 

The domain of DNN acceleration fits perfectly into the paradigm just described. First, the 
computational domain is important and admits of considerable parallelism and data locality (i.e., 
reuse opportunities). Thus, the goal is to design specialized DNN hardware to further improve 
key metrics, such as performance and energy efficiency, over general-purpose processors across 
a wide domain of DNN computations. So, second, we will explore (in this chapter) hardware 
organizations that can exploit both the parallelism and data locality in DNN computations to 


Sometimes in the literature hardware-software co-design is used to encompass both hardware-software and hardware- 
algorithm co-design. However, in this book, we want to make these forms of co-design distinct. 
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achieve those goals. Third, in Chapters 7, 8, and 9, we will explore how to co-design the hardware 
and DNN algorithms to further improve efficiency. 

When considering the hardware organizations for DNN acceleration, the design space 
for specialized DNN hardware is quite large. This is due to the fact that there are no constraints 
on the execution order of the MAC operations within a DNN layer. As a result, the hardware 
designer has significant flexibility to choose the execution order of operations and optimize the 
hardware for the target metrics, under some given resource constraints (e.g., number of compute 
datapaths and amount of storage capacity). 

In order to approach the hardware design in a large design space, we will discuss several 
key design decisions and how they affect performance and energy efficiency, and then show how 
these design decisions can be formally described using a /oop nest, which is commonly used to de- 
scribe the processing of DNNs. We will then discuss several design patterns that are commonly 
used in real-world architectures for DNN acceleration using loop nests. In the final sections 
of this chapter, we will discuss design patterns for efficient and flexible data storage and data 
movement via a flexible network on chip (NoC). 


5.1 EVALUATION METRICS AND DESIGN OBJECTIVES 


As discussed in Chapter 3, energy consumption and performance are two of the principal driving 
metrics for the design of specialized hardware. In modern compute systems, energy consump- 
tion is often dominated by data movement, especially memory access [121]. This is because 
accessing data from memory, especially off-chip storage such as DRAM, can consume orders 
of magnitude more energy than the actual computation of operands (e.g., MACs). Even for 
the on-chip memories, accessing data from the larger memory (e.g., caches) is also much more 
energy consuming than accessing data from the smaller memory (e.g., registers). Thus, in order 
to reduce energy consumption, one objective is to design hardware that reduces data movement 


by: 


e reducing the number of times values are moved from sources that have a high energy cost, 


such as DRAM or large on-chip buffers; and 


e reducing the cost of moving each value. For example, by reducing the data’s bit-width, 
which we will discuss in Chapter 7. 


Performance in terms of throughput and latency, on the other hand, is largely dictated by 
the number of processing elements (PEs) and more specifically the number of multipliers that 
can operate in parallel.” Therefore, another objective is to design hardware that: 


e allocates work to as many PEs as possible so that they can operate in parallel; and 


?The throughput of each PE also affects performance, but we consider that as an orthogonal micro-architectural design 
decision. 
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* minimizes the number of idle cycles per PE by ensuring that there is sufficient memory 
bandwidth to deliver the data that needs to be processed, the data is delivered before it is 
needed, and workload imbalance among the parallel PEs is minimized. 


Note that energy consumption and hardware performance can also be improved by design- 
ing hardware that only performs the necessary operations; for instance, when the data is sparse 
(i.e., has many zeros), the hardware should only perform MACs on the non-zero data, and it can 
also use representations that exploit zeros to save space and data movement. Exploiting sparsity 
will be discussed in Chapter 8. Finally, improving energy consumption and performance might 
be improved with new technologies, which are described in Chapter 10. 


5.2 KEY PROPERTIES OF DNN TO LEVERAGE 


‘There are several properties of DNNs that can be leveraged by the hardware to optimize for the 
design objectives discussed in Section 5.1, thus improving the hardware performance and energy 
efficiency. First, as previously mentioned, the key computation of DNNs involves many MACs 
that have no restriction on their execution order within a layer. Therefore, although DNNs re- 
quire a significant number of MACs per layer, the hardware can still achieve higher throughput 
and lower latency by exploiting high compute parallelism. However, the challenge of reducing 
the energy consumption of moving data to the parallel PEs remains. 

Thankfully, the data movement cost can be addressed by another important property of 
DNNs, which is that the same piece of data is often used for multiple MAC operations. This 
property results in three forms of data reuse, as shown in Figure 5.1. 


e Input feature map reuse: Different filters (from dimension M) are applied to the same in- 
put feature map in order to generate multiple output feature maps; therefore, each input 
activation is reused M times. 


Filter reuse: When processing a batch (of size N) of input feature maps, the same filter is 
applied to all inputs in the batch; therefore, each filter weight is reused N times. 


Convolutional reuse: For convolutional layers, the size of a filter (R x S) is smaller than the 
size of an input feature map (H x W), and the filter slides across different positions (often 
overlapping with each other) in the input feature map to generate an output feature map. 
As a result, each weight and input activation are further reused P x Q and R x S times,’ 
respectively, to generate different output activations. 


Data reuse can translate to reduced energy consumption for data movement through read- 
ing data once from a large but expensive (in terms of energy cost) memory, and either: 


3For simplicity, we ignore the halo effect at the edges of input feature maps that results in less reuse for input activations 
and assume a stride of 1. 
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Figure 5.1: Data reuse opportunities in a CONV or FC layers of DNNs. 


e store the data in a relatively small, but cheap memory, and (temporally) reuse that data 
multiple times at that cheap memory; and 


e send the same data to multiple PEs and (spatially) use the data at those distinct PEs. 


Both methods save accesses to the expensive memory, and therefore reduce the overall en- 
ergy cost. While the maximum amount of data reuse for a DNN layer is defined by its shape and 
size (e.g., number of channels, filter size, etc.), the corresponding energy cost savings is deter- 
mined by the amount of data reuse that is actually harnessed by the specialized hardware through 
these methods. We will discuss how to apply these methods in the hardware in Sections 5.4 and 
5.5. 

In addition to exploiting data reuse for input activations and filter weights, specialized 
hardware can also reduce the energy cost of data movement by properly orchestrating the move- 
ment of partial sums, which are the intermediate results from the multiplications. If partial sums 
can be temporally accumulated in a small buffer (e.g., registers) the energy cost is less than read- 
ing and updating a value from a larger buffer. Alternatively, accumulating the partial sums in 
the same cycle through an adder tree (i.e., as a spatial sum) can reduce the required storage ca- 
pacity [135, 142]. In the best case, all C x R x S partial sums for each output activation can be 
accumulated in one cycle, and therefore partial sums never need to be stored in memory. 

Employing a spatial sum can also reduce the energy and latency compared to the same 
capability implemented with individual two-input adders [143]; specifically, rather than per- 
forming a carry propagation after each addition, using a redundant binary representation for the 
adder tree allows the carry propagation to be deferred to a single adder at the bottom of the tree. 
The magnitude of the benefit of the spatial sum over temporal sum depends on the reduction 
factor [144]. 

Although the cost of data movement for each data type can be minimized individually, 
it is not possible to do so for all three data types (i.e., input activations, weights, and partial 
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sums) at the same time. This is due to the fact that different MACs can only reuse data for at 
most one data type at a time. For example, MACs that reuse the same weight require different 
input activations and the generated partial sums cannot be further accumulated together. ‘Thus, 
the methods mentioned above can only be applied to reduce the energy cost of one data type at 
a time and will increase the cost of data movement for the other two data types. Therefore, in 
order to minimize the overall energy consumption, it is important to balance the cost of data 
movement for all three data types through an optimization process instead of just minimizing 
for any specific one of them. 

The optimization process depends on the specific shape and size of the DNN layer in 
addition to the hardware constraints. However, while the hardware constraints stay fixed, the 
shape and size of a DNN layer can vary dramatically across different DNNs and also across dif- 
ferent layers within a DNN. ‘Therefore, the optimization of data movement has to be performed 
for each DNN layer separately,’ and the hardware needs to be able to support different config- 
urations of data movement based on the results of the optimization. This flexibility requirement 
raises several considerations that are critical to the design of specialized DNN hardware, which 
we will discuss in the next section. 


5.3 DNN HARDWARE DESIGN CONSIDERATIONS 


The challenges of designing specialized DNN hardware involve designing a flexible architecture 
and then finding the best ways to configure the architecture to get optimal hardware perfor- 
mance and energy efficiency for different DNN layers. These two aspects are tightly correlated, 
since the chance of finding the optimal configuration depends on the flexibility of the hardware, 
while higher flexibility often implies a loss of efficiency since additional hardware is required. 
‘Therefore, it often takes an iterative process to distill down to the best design. This is in contrast 
to the approach taken in Chapter 4, where the hardware architecture is already fixed and the 
focus is on adapting the computation to fit in the compute paradigm of the hardware. 

Given a particular DNN model, it is necessary to be able to configure the hardware to 
minimize the overall energy consumption while maintaining high performance. This process 
involves finding an optimal mapping, where a mapping defines: (1) the execution order of the 
MAC operations, both temporally (i.e., serial order on the same PE) and spatially (i.e., across 
many parallel PEs); and (2) how to tile and move data across the different levels of the memory 
hierarchy to carry out the computation in accord with that execution order. For a given DNN 
layer, there often exists a large number of possible mappings. As a result, it is very crucial to be 
able to find the best mappings for the desired metrics. In Section 5.4, we will show examples 
of different mappings and discuss their impact on the performance of the hardware. Chapter 6 
extends that discussion with more details of the mapping optimization process. 


4Most of the focus in this chapter and the next is on per layer processing. Section 5.7.6 has a brief discussion of cross-layer 
optimization. 
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Given the sheer number of possible spatio-temporal orderings of MAC operations for a 
DNN layer, it is generally very unlikely for a hardware architecture to support the execution of 
all these orderings. Therefore, practical hardware will support a more restricted set of orderings, 
which consequently reduces the number of legal mappings. Therefore, a very critical design 
consideration for DNN hardware is to select a subset of mappings to support. In other words, 
the hardware architecture has to set up rules that narrow down the range of considered mappings 
in the optimization. An important set of rules that determine the subset of supported mappings 
is called a dataflow. Dataflow is one of the most important attributes that define the architecture 
of a DNN hardware, as the subset of supported mappings directly impact the quality of the 
optimization. In Section 5.6, we will formally define dataflow and mapping using a loop nest- 
based representation. 

An additional design consideration is that the shape and size of a DNN layer, as described 
in Chapter 2, can vary dramatically across different DNNs and also across different layers within 
a DNN. Furthermore, the number of layers that are found in emerging DNNs continues to 
grow. Therefore, given this rapidly growing and changing field, it is increasingly important to 
design DNN hardware that is sufficiently flexible and scalable to address these varying needs. 
Specifically, the hardware should avoid making assumptions about the layer shape and size; 
for instance, since DNNs have tended to grow in size, one cannot assume the entire model can 
always be stored on-chip. The degree of flexibility needed in the hardware to efficiently support a 
wide variety of DNNs has become one of the main challenges in the design of DNN accelerators. 

To summarize, the design and use of specialized DNN hardware involves multiple steps. 


e At design time, an architecture is specified with a set of attributes. These attributes in- 
clude: (1) the dataflow or dataflows supported; (2) the number of PEs and the number 
of multipliers and adders per PE;° (3) the memory hierarchy, including the number of 
storage levels and the storage capacity at each level; and (4) the allowed patterns of data 
delivery for the NoC within the memory hierarchy and between the memory and PEs. 
Note, these attributes set certain limitations on the legal mappings of an architecture. For 
example, the amount of storage capacity at each level of the memory hierarchy impacts 
how much data reuse can be exploited in the hardware, which limits the set of supported 
mappings. We will discuss these limitations in Section 5.5. 


e At mapping time, given a DNN model, a mapping that optimizes the desired operational 
metrics is selected from among all the mappings supported by the accelerator. In Chap- 
ter 6, we will discuss the process of finding optimal mappings for a DNN accelerator. 


e At configuration time, a configuration derived from the selected mapping is loaded into 
the accelerator. 


> Without losing generality, we will assume that each PE contains a single multiplier and adder in the rest of the chapter 
unless explicitly specified. 
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e At runtime, the desired DNN input data is loaded and processed according to the loaded 
configuration. However, depending on the capabilities of the accelerator (e.g., how many 
layers can it process per configuration) it might need to iterate between the configuration 
step and the run step multiple times. 


In the next section, we will explore how to exploit data reuse. 


5.4 ARCHITECTURAL TECHNIQUES FOR EXPLOITING 
DATA REUSE 


As hinted in Section 5.2, there are two methods for exploiting data reuse to reduce the energy 
cost of data movement. ‘These architectural techniques are called temporal reuse and spatial reuse. 
In this section, we will formally define them and describe how to apply these techniques in the 
hardware. 


5.4.1 TEMPORAL REUSE 


Temporal reuse occurs when the same data value is used more than once by the same consumer 
(e.g.,a PE). It can be exploited by adding an intermediate memory level to the memory hierarchy 
of the hardware, where the intermediate memory level has a smaller storage capacity than the 
level that acts as the original source of the data; an example is shown in Figure 5.2d. Since smaller 
memories consume less energy to access than larger memories, the data value is transferred once 
from the source level (i.e., larger memory) to the intermediate level (i.e., smaller memory), and 
used multiple times at the intermediate level, which reduces the overall energy cost.° 

Since the intermediate memory level has a smaller storage capacity, it cannot fit all data 
from the source level at the same time. As a result, data in the intermediate level may be replaced 
by new data from the source level and lose the chance to further exploit temporal reuse. Whether 
data in the intermediate level will be replaced or not depends on the reuse distance. For exploiting 
temporal reuse, the reuse distance is defined as the number of data accesses required by the 
consumer in between the accesses to the same data value, which is a function of the ordering 
of operations. Figure 5.2 shows an example to illustrate this phenomenon. For the example 
1-D convolution workload shown in Figure 5.2a, two different operation orderings are shown 
in Figures 5.2b and 5.2c: the former has a temporal reuse distance for weights of 4, while the 
latter has a temporal reuse distance for weights of 1. Given the memory hierarchy shown in 
Figure 5.2d, in which we are only showing the memory space allocated to weights and it only 
has 1 slot in the intermediate memory level (L1) allocated to weights, the ordering in Figure 5.2b 
will have to keep swapping the weights in L1, while the ordering in Figure 5.2c can read each 
weight once from the source into L1 and reuse it for 4 times. Therefore, if the reuse distance for 
a data type is smaller than or equal to the storage capacity of the intermediate memory level, 


6The assumption is that the ratio of energy per access between the two levels is large enough so that it is worthwhile to 
move data from the large memory to the smaller memory for reuse. 
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(c) Operation ordering 2: weight reuse distance = 1 


Figure 5.2: To run the example 1-D convolution in (a), two possible operation orderings are 
shown in (b) and (c). (b) has a reuse distance of 4 for weights, while (c) has a reuse distance of 
1. Given the memory hierarchy shown in (d), which has only 1 slot in the intermediate memory 
allocated to weights, the ordering in (c) can fully exploit temporal reuse while the ordering in (b) 
cannot, since it has a reuse distance larger than the storage capacity of the intermediate memory. 
Note that the numbers shown here are the indices of the values in each of the data vector. 


temporal reuse can be exploited for all values of that data type. However, if the reuse distance 
is larger, then a part or all of the data values that were stored in the intermediate level would be 
replaced before the reuse opportunities are fully exploited. In other words, the storage capacity 
of the intermediate memory level limits the maximum reuse distance where temporal reuse can 
be exploited. 

Reducing the reuse distance of one data type often comes at the cost of increasing the 
reduce distance of other data types, which should be taken into account at the same time. Al- 
though it is possible to increase the storage capacity of the intermediate memory level to exploit 
temporal reuse on larger reuse distances, it has the counter effect that the energy cost per mem- 
ory access also goes up, which increases the average energy cost of data movement for all reuse 
distances. In order to keep the memory small, an alternative solution is to reduce the reuse dis- 
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tance by changing the processing order of MACs. Various techniques for reducing reuse distance 
will be discussed in Section 5.5. 

Temporal reuse can be exploited at multiple levels of the memory hierarchy. By treating the 
intermediate memory level as the new consumer of data from the source level, additional levels of 
memory can be added to further exploit temporal reuse. However, adding more memory levels 
requires more area and results in reduced ratio of energy cost between different levels, which 
diminishes the effectiveness of exploiting temporal reuse. 


5.4.2 SPATIAL REUSE 


Spatial reuse occurs when the same data value is used by more than one consumer (e.g., a group 
of PEs) at different spatial locations of the hardware. It can be exploited by reading the data once 
from the source memory level and multicasting it to all of the consumers. Exploiting spatial reuse 
has the benefits of (1) reducing the number of accesses to the source memory level, which reduces 
the overall energy cost, and (2) reducing the bandwidth required from the source memory level, 
which helps to keep the PEs busy and therefore increases performance. 

Ifa group of consumers of a data value have no storage capacity, spatial reuse can only be 
exploited by the subset of consumers that can process the data in the same cycle as the multicast 
of the data; the other subset of consumers that requires the same data value, but cannot process 
it in the same cycle, needs to be sent the data again by the source memory level. In contrast, 
if each consumer in the group has some storage capacity, then a certain time span (dictated by 
the storage capacity) in which the same data value is processed by multiple consumers can be 
tolerated when exploiting spatial reuse with multicast. In other words, whether a consumer in 
the group that uses the same data value can exploit spatial reuse or not also depends on the reuse 
distance. For exploiting spatial reuse, the reuse distance is defined as the maximum number of 
data accesses in between any pair of consumers that access the same data value, which is again a 
function of the ordering of operations. 

An example is shown in Figure 5.3. For the same 1-D convolution workload in Figure 5.2, 
we run it on a new architecture with the memory hierarchy shown in Figure 5.3a, which has four 
consumers, C0 to C3. Each consumer also has 1 slot of local storage for weights. Figures 5.3b, 
5.3c, and 5.3d show three different orderings of operations (with only the weights being shown 
for simplicity). In addition to the time axis, the orderings also have the space axis to indicate the 
ordering at each parallel consumer. The spatial reuse distance for weights are 0, 1, and 3 for the 
orderings in Figure 5.3b, 5.3c, and 5.3d, respectively. Since the reuse distance of the orderings 
in Figure 5.3b and 5.3c are smaller than or equal to the storage capacity at the consumer level, 
only one single multicast from the source memory to all consumers is needed for each weight. 
However, for the ordering in Figure 5.3d, multiple reads from the source memory for each 
weight is needed since its reuse distance is larger than the storage capacity at the consumers. 

In addition to the reuse distance, since spatial reuse involves routing data to multiple 
destinations in the hardware, the NoC that distributes data also plays an important role for 
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Figure 5.3: To run the same 1-D convolution, as shown in Figure 5.2a, on the memory hierarchy 
shown in (a), three possible operation orderings are shown in (b), (c), and (d), which have a 
spatial reuse distance of 0, 1, and 3, respectively. The distance between the red boxes in the 
figures shows the reuse distance. Since each consumer has 1 slot of local storage for storing 
weights, the orderings in (b) and (c) can fully exploit spatial reuse of weights since their reuse 
distances are smaller than or equal to the storage capacity at the consumers, while the ordering 
in (d) has to read each weight multiple times from the source memory level. 


achieving spatial reuse. Specifically, the NoC has to support the data distribution patterns as 
derived from the ordering of operations. For example, as shown in Figure 5.4, if a data value is 
used by all consumers in the same cycle (e.g., Figure 5.3b), but the memory level L1 is banked 
in a way that each bank only connects to a subset of consumers, then spatial reuse needs to be 
first exploited from a higher memory level L2, where data is multicast from L2 to all banks in 
L1 with each bank serving as a single consumer, and then multicast again from the L1 banks to 
all of the consumers. Exploiting spatial reuse at higher levels of the memory hierarchy creates 
more duplicated data in the hardware, which is not desirable. However, supporting multicast 
to all consumers in a single level at a large scale can also be expensive. Therefore, it is a design 
trade-off to determine where and how to exploit spatial reuse in the hardware. We will discuss 
the design of NoCs for DNN accelerators in Section 5.9. 
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Figure 5.4: The physical connectivity of the NoC between the L1 memory and the consumers 
limits the multicast from any banks in L1 to all consumers. Therefore, data needs to be first 
multicast from the L2 memory to all banks in L1, and then each L1 bank further multicast the 
data to the associated consumers. In this case, data is being duplicated twice in L1. 


5.55 ‘TECHNIQUES TO REDUCE REUSE DISTANCE 


As discussed in Section 5.4, the order of operations determines the reuse distance, which im- 
pacts the effectiveness of exploiting either temporal reuse or spatial reuse. In this section, we 
will discuss the various methods for manipulating the order of operations to reduce the reuse 
distance. 

First, since it is not feasible to minimize the reuse distance for all data types simultane- 
ously, as described in Section 5.2, the order of operations has to prioritize reducing the reuse 
distance of a certain data type over the others. Figures 5.2b and 5.2c are two examples of opera- 
tion ordering that prioritize reducing the reuse distance of partial sums and weights, respectively, 
for temporal reuse. Each data value of the prioritized data type appears to stay stationary over 
time in the sequence of operations (i.e., the same value gets accessed many times from mem- 
ory consecutively). Later on, when we discuss loop nests in Section 5.6, we will show that the 
stationariness of different data types is controlled by the ordering of the nested loops, which 
determines their priority in reuse distance reduction. 

Data tiling, also referred to as blocking, is another technique that is commonly used to 
reduce the reuse distance, such as tiling for CPUs or GPUs, as described in Section 4.2. Data 
is partitioned into smaller fi/es through tiling, and the processing only focuses on a tile at a 
time for each data type. There are many ways to tile the data for processing. In addition to 
the size of each tile, deciding along which dimensions the data (which is 4-D for each data 
type in DNNs) is being tiled is another design decision. The goal is to tile the data so that the 
reuse distance becomes smaller. For example, in the operation ordering in Figure 5.2c, where 
weights are prioritized (i.e., each weight stays stationary across multiple cycles) the temporal 
reuse distance of partial sums is large, and is the same as the length of the entire output activation 
vector, which is 4 in this case. Tiling can be applied to reduce the reuse distance of partial sums, 
as shown in Figure 5.5, by only working on half of the output activations at a time, thus cutting 
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Figure 5.5: A temporally tiled version of the operation ordering in Figure 5.2c. 


the reuse distance of partial sums by two times down to 2. However, this also increases the 
average reuse distance of weights, though still being the most prioritized data type. 

Tiling can be employed to exploit either temporal reuse or spatial reuse. Tiling for tempo- 
ral reuse, or temporal tiling, focuses on reducing the reuse distance of specific data types to make 
it smaller than the storage capacity of a certain memory level in the memory hierarchy. For ex- 
ample, assuming the intermediate memory level L1 in Figure 5.2d has a total storage capacity 
of three data values,’ the un-tiled ordering in Figure 5.2c can only exploit temporal reuse of 
weights, since the temporal reuse distance of partial sums is so large that they have to be stored 
and fetched from higher memory levels. However, with the tiled ordering in Figure 5.5, one 
weight value and a tile of two partial sum values can fit into L1 to exploit temporal reuse. Note 
that, however, each weight value that is written into L1 can only exploit temporal reuse for two 
times in this case, and needs to be read twice from a higher memory level into L1 instead of 
once as in the un-tiled ordering. 

On the other hand, tiling for spatial reuse, or spatial tiling, focuses on (1) reusing the 
same data value by as many consumers as possible, and (2) reducing the reuse distance so that 
one multicast can serve as many consumers as possible given a fixed amount of storage capacity 
at each consumer. For example, Figures 5.6a, 5.6b, and 5.6c show the impact of three differ- 
ent spatial tilings that result in operation orderings with different degrees of spatial reuse. The 
ordering in Figure 5.6a has no spatial reuse of weights as each weight is only used by a single 
consumer. The ordering in Figure 5.6c has the highest degree of spatial reuse as each weight is 
used by all consumers. Spatial tiling effectively improves the amount of spatial reuse, saving the 
number of reads from the source memory. 

The goal of both temporal and spatial tiling is to exactly match the reuse distance of a 
data type to the available storage capacity at a certain memory level to maximally exploit data 
reuse at that level. Overly reducing the reuse distance does not bring additional benefits but only 
increases the reuse distance of other data types. However, exactly matching the reuse distance 


7We will ignore the differences in bit-width between different data types for now. In general, increasing the bit-width 
of a data type would increase its impact on energy consumption, and thus the dataflow would be more sensitive to that data 


type. 
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Figure 5.6: Operation orderings with different degrees of spatial reuse for the memory hierarchy 
shown in Figure 5.3. 


to the storage capacity is not always feasible since the shapes and sizes of data vary across differ- 
ent layers and different DNNs, while the hardware storage capacity remains fixed. This causes 
fragmentation in mapping, in which case the workload cannot be evenly divided to run on the 
hardware and therefore would result in under-utilization of the hardware. 

In addition, for spatial tiling, we would like to fully utilize all PEs to achieve maximum 
performance while exploiting spatial reuse, which is also not always possible since there might 
not be sufficient spatial reuse to exploit. For example, a common way to distribute data across 
PEs to exploit parallelism is shown in Figure 5.7: data from different output channels (M) are 
sent to different PEs vertically, while data from different input channels (C) are sent to different 
PEs horizontally. Therefore, each PE gets a unique weight, and each input activation is reused 
spatially across a column of PEs with weights from different output channels. However, if M is 
smaller than the number of PEs in a column and/or C is smaller than the number of PEs in a 
row, only a portion of the PEs are utilized (e.g., the colored ones in Figure 5.7). To improve the 
utilization in such cases, multiple tiles of data can run at the same time while not getting more 
spatial reuse, as discussed in Section 5.7.4. This technique, however, is only feasible if there is 
sufficient flexibility of data delivery and the NoC provides sufficient bandwidth. 

Both data prioritization and tiling can be applied independently at each level of the mem- 
ory hierarchy. Different data types can be prioritized at different levels of the memory hierarchy, 
which helps to balance the energy cost of accessing all types of data. For example, while weights 
are prioritized for access from memory level L1 to the PEs, partial sums can be prioritized for 
access from memory level L2 to L1. Also, each level of the memory hierarchy can perform either 
temporal or spatial tiling, or both at the same time. There are no direct interactions between the 
tiling decisions for spatial and temporal; instead, they can be interleaved at different levels of 
the storage hierarchy. For example, in Figure 5.4, temporal tiling and spatial tiling are applied 
at both level L1 and L2 of the storage hierarchy. The consumers at the last level can have their 
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Figure 5.7: Spatial tiling can result in under-utilization of the parallelism if there is insufficient 
spatial reuse to exploit. For example, if the number of output channels (M) or input channels 
(C) is smaller than the number of PEs in a column or row, respectively, the specific mapping 
shown here will only have a portion of PEs utilized for processing (e.g., the colored ones). 


local storage for temporal tiling; however, if the consumers at the last level have no local storage, 
they will operate like a vector-based architecture as mentioned in Chapter 4. Note that, unlike 
many vector architectures, the consumers can communicate data with each other, which is often 
used to do spatial accumulation of the partial sums across PEs. 

Finally, in addition to the reuse distance, the bandwidth requirement should also be taken 
into account when performing operation reordering or data tiling. Specifically, certain opera- 
tion orderings can have higher peak bandwidth requirement than others even though the average 
bandwidth is the same, which often happens during the ramp-up or ramp-down of the compu- 
tation. For example, when multiple output activations are generated by parallel PEs in the same 
cycle, it will require a high peak bandwidth in order to store them to memory immediately. Also, 
if techniques such as double buffering are used to prefetch data and hide the data access latency, 
the effective storage capacity of the memory levels becomes smaller under a fixed budget of total 
storage capacity, which should be accounted for when calculating the required reuse distance. 

While the techniques introduced in this section can be applied in many different ways, 
there are a few common approaches in term of how they are applied in existing designs, which 
forms a taxonomy of dataflows. In the next section, we will go through these dataflows in detail 
by formally introducing the a loop nest-based representation for dataflows. 
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Using the tensor index notation from Chapter 2, the calculation for a CONV layer (with unit 
stride) is: 
Onmpq = Ou... The(pt+r)(qts)F mers) + bm. (5.1) 


‘That calculation, which can serve as a specification of a desired computation, imposes no order- 
ing or notion of parallelism on the individual calculations, but those characteristics are important 
to the performance and efficiency of the computation when run on hardware. Specifying an or- 
dering, and which calculations run in parallel, is called a dataflow. 

A dataflow defines some of the specific rules for controlling activity on an accelerator. 
‘These include the ordering of the operations, and how to prioritize the use and transfers of data 
temporally and spatially across the memory hierarchy and compute datapaths. It also dictates 
which mappings are legal and directly impacts the performance and energy efficiency of the 
DNN accelerator. Therefore, it is very crucial to be able to precisely describe a dataflow. In this 
section, we will formally introduce /oop nests, which is a powerful tool for this purpose. 

Loop nests are a compact way to describe various properties of a given DNN accelerator 
design and, in specific, its dataflow. Figure 5.8a shows the loop nest that can represent the oper- 
ation ordering of the example 1-D convolution in Figure 5.2c. The loop with variable s, which 
indexes filter weights, is placed as the outermost loop. That loop traverses the open range [0, S), 
where S is the number of filter weights and is defined as part of the shape of the convolution.’ 
Another component of the shape of the convolution is Q (the number of partial sums), which is 
traversed over the open range [0, Q) by the inner loop using the variable q. The final component 
of the shape is W (the number of input activations), which is traversed via a simple computation 
on variables s and q. More discussion on such computations are included in Section 6.3. 

Since in Figure 5.8a the weights are traversed in the outermost loop, the ordering priori- 
tizes activity to reduce the reuse distance of filter weights. We call the dataflow constructed this 
way a weight-stationary (WS) dataflow. While we are illustrating the idea with an 1-D example, 
it can be generalized to data of higher dimensions: it is a WS dataflow as long as the loops that 
go through different weights are placed above all the other loops. 

Using the space-time diagram in Figure 5.9, we can see the reference activity to each data 
array in each cycle (i.e., execution of the body of the loop nest). In that figure, we can clearly see 
the horizontal sequence of green es, which represent the same filter weight being reused multiple 
times before processing moves to a new weight. The partial sums are accessed repeatedly after 
a long interval and the input activations are accessed in a large sliding window pattern. In both 
cases, the reuse interval is Q. 

Different dataflows can be created by reordering the loops as convolution imposes no 
ordering constraints. Therefore, we can also create an output-stationary (OS) dataflow, as shown 

8By convention in this book, we will typically use a capital letter to represent a constant value that specifies a part of the 


shape of a computation (e.g., M, C, H, W, R, and S for a CONV layer) and the associated small letter (e.g., m, c, h, w, r, s) 


to represent a variable that accesses a data structure along a dimension associated with the corresponding capital letter. 
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Figure 5.8: Three un-tiled loop nests for the 1-D convolution workload in Figure 5.2a. In this 
case, W = 6, S = 4, and Q = 4. (a) is a weight-stationary dataflow that generates the order- 
ing in Figure 5.2c, while (b) is an output-stationary dataflow that generates the ordering in 


Figure 5.2b. (c) is an input-stationary dataflow. 
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Figure 5.9: Space-time diagrams of partial sums (red), input activations (blue), and filter weights 
(green) for the 1-D weight-stationary convolution dataflow from Figure 5.8a with workload 
shape, S = 4, Q = 9, and W = 12. On each plot, each step on the x-axis is a cycle (time) 
and the y-axis represents an offset into the corresponding data array (space). Thus, a point at 
(x, y) = (20, 1) represents an access to element 1 during cycle 20. 


in Figure 5.8b. In this case, the loops that go through different partial sums are placed as the 


outermost loop. 
Figure 5.10a shows the space-time diagrams for this dataflow. The reference pattern for 


the outputs show that each partial sum is referenced multiple times in succession and the partial 
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Figure 5.10: Space-time diagrams of partial sums (red), input activations (blue), and filter 
weights (green) for the 1-D output-stationary convolution dataflow from Figure 5.8b with work- 


load shape, S = 4, Q = 9, and W = 12. 


sum is completed before processing moves to the next partial sum. The reference pattern for 
filter weights shows that all the weights are used repeatedly through the run with reuse distance 
S. Finally, the blowup of the reference pattern for input activations in Figure 5.10b illustrates a 
sliding window references of size S through the input activations. 
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# i[W] - input activations 
# £[S] - filter weights 
# o[Q] - output activations 


for gl in range(Q1): 
for s in range(S): 
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Figure 5.11: A tiled loop nest for the weight-stationary dataflow in Figure 5.8a, which can 
generate the ordering, as shown in Figure 5.5, by setting Q0 = 2 and Q1 =2. 


It is also possible to create an input-stationary (IS) dataflow; however, we have to use 
w and s as the loop variables and calculate the index variable q, as shown in Figure 5.8c. The 
characteristics and associated architecture designs for each of these dataflows will be discussed 
in Section 5.7. 

The loop nests we have introduced so far are not tiled (i.e., the loops in these loop nests 
go through the entire dimension of the data). Tiling the data of a specific data type involves 
picking a specific dimension of the corresponding data type and breaking it up into multiple 
loops. Figure 5.11 shows an example loop nest of the tiled ordering shown in Figure 5.5. The 
loop that originally goes through dimension Q is now being divided up into two loops with 
new loop bounds Q1 and QO. The inner bound QO defines the size of a tile, while the outer 
bound Q1 defines the number of tiles. The same process can be repeated to break it up into 
more loops, which creates multi-level tiles that can be put into different storage levels in the 
memory hierarchy. Tiling can also be applied to multiple data dimensions simultaneously by 
breaking up loops for different dimensions. 

Parallel processing can also be described by the loop nest by introducing parallel-for 
loops in addition to the conventional for loops. The different operations that are iterated 
through in the parallel-for loops will then run on parallel consumers, e.g., PEs. For ex- 
ample, we can make a parallel weight-stationary dataflow by making the loop with variable s0 a 
parallel-for ina filter weight tiled version of the loop nest of Figure 5.8a. The new loop nest 
is shown in Figure 5.12. 

A space-time diagram of the parallel weight-stationary dataflow is shown in Figure 5.13. 
Here we see that two weights are used repeatedly—one in each PE. Furthermore, both PEs 
access the same partial sum, thus providing an opportunity for a spatial sum (see Section 5.2). 
With respect to the input activations, the two PEs use the same input activation in successive 
cycles providing an opportunity for inter-PE communication or multicast if the PE buffer can 
hold more than one input activation (i.e., the data will arrive at the PE marked by “e”, and wait 
a cycle until it is used). 
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Figure 5.12: A loop nest for the weight-stationary dataflow in Figure 5.8a with parallel process- 
ing for the a tile of filter weights (data dimension S0). 
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Figure 5.13: Space-time diagrams of partial sums (red), input activations (blue), and filter 
weights (green) for the 1-D weight-stationary convolution dataflow from Figure 5.12 with two 
parallel PEs. Activity in one PE is marked with a e and the other an x. The workload shape is 
S=4,0 =9, W = 12, and S is tiled into $1 = 2, SO = 2. 


Figure 5.13 also illustrates the impact of tiling on reference patterns. Because the S di- 
mension is tiled as $1 = 2 and SO = 2, the weights are divided into two tiles of two weights 
each. In this case, the first tile is used in the first nine (Q) cycles and the second in the second 
nine (Q) cycles. Both tiles also contribute to the same partial sums, so partial sums have a long 
tile-related reuse distance of nine cycles (Q). Finally, most of the input activations are used by 
both tiles, so they also are used with a long reuse distance of Q — 1 cycles. 
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Table 5.1: Classification of recent work by dataflow 


Dataflow Recent Work 
Weight Stationary | NVDLA [132], TPU [142], neuFlow [143], Sankaradas et al. [144], Park 


(Section 5.7.1) et al. [145], Chakradhar et al. [146], Sriram et al. [147], Origami [148] 
Output Stationary | DaDianNao [149], DianNao [150], Zhang et al. [151], Moons et 


(Section 5.7.2) al. [152], ShiDianNao [153], Gupta et al. [154], Peeman et al. [155] 
Input Stationary 


(Section 5.7.3) 


Row Stationary 


(Section 5.7.4) 








SCNN [156] 





Eyeriss v1 [157, 139], Eyeriss v2 [158] 








A dataflow only defines the following aspects of a loop nest: (1) the specific order of the 
loops to prioritize the data types; (2) the number of loops for each data dimension to describe 
the tiling; and (3) whether each of the loops is temporal (for) or spatial (parallel-for). The 
maximum number of loops that each data dimension can have is capped by the number of storage 
levels in the hierarchy that the specific data type can utilize. 

‘The specific loop bounds in the loop nest, e.g., SO and S1 in Figure 5.11, are not defined 
by the dataflow. However, the maximum value of each loop bound can be limited by a variety 
of factors, including: the storage capacity for the temporal loops, by the number of reachable 
consumers through the multicast network for the spatial loops (i.e., parallel-for), or by the 
size of the data dimension. Determination of the specific values of the loop bounds to use for a 
particular workload are determined by the optimization process that finds the optimal mapping 
as will be discussed in more depth in Chapter 6. 


5.7 DATAFLOW TAXONOMY 


In Sections 5.4 and 5.5, we have introduced several techniques to exploit data reuse. While there 
are many ways to apply these techniques, there are several commonly used design patterns that 
can be categorized into a taxonomy of dataflows: Weight Stationary (WS), Output Stationary 
(OS), Input Stationary (IS), and Row Stationary (RS). These dataflows can be seen in many 
recent works of DNN accelerator design, as shown in Table 5.1. In this section, we will use a 
generic architecture to describe how these dataflows are used in the recent works. As shown in 
Figure 5.14, this architecture consists of an array of PEs, with each PE having some local storage 
called a register file (RF), and the array of PEs shares a common storage level called the global 
buffer. 
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Figure 5.14: A generic DNN accelerator architecture. 


5.7.1 WEIGHT STATIONARY (WS) 


The weight-stationary dataflow is designed to minimize the energy consumption of reading 
weights by maximizing the reuse of weights from the register file (RF) at each PE (Figure 5.15a). 
‘The reuse distance of weights is minimized; therefore, each weight is read from DRAM into the 
RF of each PE and stays stationary for further accesses. ‘The processing runs as many MACs that 
use the same weight as possible while the weight is present in the RF; it maximizes convolutional 
and filter reuse of weights. The inputs and partial sums must move through the spatial array and 
global buffer. The input feature map activations are broadcast to all PEs and then the partial 
sums are spatially accumulated across the PE array. 

One example of previous work that implements a weight-stationary dataflow is nn-X (also 
called neuF low) [146], which uses eight 2-D convolution engines, each of which is capable of 
processing a 2-D filter up to 10x10 in size. There are a total of 100 MAC units (i.e., PEs) per 
engine with each PE having a weight that stays stationary for processing. Figure 5.16 shows one 
2-D convolution engine that can process 3x3 filters as a simplified example. ‘The input feature 
map activations are broadcast to all MAC units and the partial sums are accumulated across the 
MAC units. In order to accumulate the partial sums correctly, additional delay storage elements 
are required, which are counted into the required size of local storage. 
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Figure 5.15: The taxonomy of commonly seen dataflows for DNN processing. Act means input 
activation. ‘The color gradient is used to note different values of the same data type. 


Another example architecture that implements the weight-stationary dataflow is Nvidia’s 
Deep Learning Accelerator (NVDLA) [135], as shown in Figure 5.17. While weights stay sta- 
tionary in each PE, the way input activations and partial sums are orchestrated through the PE 
array is different from the nn-X example, and is the same as the example shown in Figure 5.7. 
Note the opportunity for a spatial sum (as described in Section 5.2) vertically along the column 
of PEs. The loop nest representation of a simplified version of NVDLA is shown in Figure 5.18, 
which illustrates the parallelism over the input and output channels and the stationarity of the 
weights. 

Google’s TPU is another design that features a weight-stationary dataflow [145]. A ma- 
jor difference between TPU and NVDLA is that TPU utilizes a systolic array to share input 
activations and accumulate partial sums across the PEs. Other weight-stationary examples are 


found in [147-151]. 
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Figure 5.16: WS dataflow as implemented in nn-X or neuF low [146]. 


5.7.2 OUTPUT STATIONARY (OS) 


‘The output-stationary dataflow is designed to minimize the energy consumption of reading 
and writing the partial sums (Figure 5.15b). Through minimizing the reuse distance of partial 
sums, it keeps the accumulation of partial sums for the same output activation value local in 
the RF. In order to keep the accumulation of partial sums stationary in the RF, one common 
implementation is to stream the input activations across the PE array and broadcast the weights 
to all PEs in the array from the global buffer. 

Figure 5.19 shows one example that implements an output-stationary dataflow presented 
by ShiDianNao [156], where each PE handles the processing for each output activation value by 
fetching the corresponding input activations from neighboring PEs. The PE array implements 
dedicated NoCs to pass data horizontally and vertically. Each PE also has data delay registers 
to keep data around for the required number of cycles. At the system level, the global buffer 
streams the input activations and broadcasts the weights into the PE array. The partial sums 
are accumulated inside each PE and then get streamed out back to the global buffer. Another 
example can be seen from the work by Moons et al. [155]. As shown in Figure 5.20, each input 
activation is reused across all PEs in the same column, while each weight is reused across all PEs 
in the same row. Other examples of output stationary are found in [157, 158]. 

There are multiple possible variants of output stationary, as shown in Figure 5.21, since 
the output activations that get processed at the same time can come from different dimensions. 
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Figure 5.17: Simplified version of a WS dataflow as implemented in NVDLA [135], which 
processes a weight from each of three input channels (c) and eight output channels (m) each 
cycle. The figure shows the first time step of computation. Subsequent time steps will increment 
the values of the indices of the output activations (p and q) and then more slowly the current 
weights (r and s). 


For example, the variant O S4 targets the processing of CONV layers, and therefore focuses 
on the processing of output activations from the same channel at a time in order to maximize 
convolutional data reuse opportunities. The variant O Sc targets the processing of FC layers, and 
focuses on generating output activations from all different channels, since each channel only has 
one output activation. The variant OSg is something in between OS4 and OSc. Example of 
variants OS4, OSg, and OSc are [156], [155], and [158], respectively. 


5.7.3. INPUT STATIONARY (IS) 


Similar to the previous two dataflows, the input-stationary dataflow is designed to minimize the 
energy consumption of reading input activations (Figure 5.15c). With minimized reuse distance, 
each input activation is read from DRAM and put into the RF of each PE and stays stationary 
for further access. Then, it runs through as many MACs as possible in the PE to reuse the same 
input activation. It maximizes the convolutional and input feature map reuse of input activations. 
While each input activation stays stationary in the RF, unique filter weights are uni-cast into the 
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# i[C,H,W] - Input activations 
# F[M,C,R,S] - Filter weights 
# o[M,P,Q]; - Output activations 


parallel-for m in range(M): 
parallel-for c in range(C): 
for r in range(R): 
for s in range(S): 
for p in range(P): 
for q in range(Q): 
o[m,p,q] += i[c,ptr,qts] * f[m,c,r,s] 





Figure 5.18: Loop nest for simplified variant of the WS dataflow implemented in 
NVDLA [135]. To handle larger numbers of input channels (C) or output channels (M), addi- 
tional temporal for loops would need to be added. Additional parallelism could also be added, 
for example, over the weight indices (r and s). 
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Figure 5.19: OS dataflow as implemented in ShiDianNao [156]. 


PEs at each cycle, while the partial sums are spatially accumulated across the PEs to generate 
the final output activation. 

One example that implements the input-stationary dataflow is SCNN [159], where each 
PE, as shown in Figure 5.22, can process four stationary input activations in parallel, with each 
input activation being processed by a SIMD lane of width four. Therefore, each PE has a par- 
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Figure 5.21: Variations of output stationary [142]. 


allelism of 16 MACs. SCNN takes advantage of the Cartesian product: any input activation in 
a plane of H x W feature map (i.e., a single input channel) can be reused across R x S x M 
filter weights, and vice versa. Therefore, the PE first fetches 4 input activations out of the input 
feature map of size H x W, goes through 4 out of the R x S x M weights each cycle until all 
weights are looped through, and then switches to the next 4 input activations. Each cycle the 
PE takes 4 input activations and 4 weights, and processes 16 MACs. ‘The partial sums for each 
output activation are spatially accumulated 4 times each cycle, and are put into a partial sum RF 
to be further accumulated across cycles. SCNN also supports processing sparse data, which will 
be discussed in Chapter 8. 
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Figure 5.22: IS dataflow as implemented in SCNN [159]. 


5.7.4 ROW STATIONARY (RS) 


A row-stationary dataflow is proposed in [142], which aims to maximize the reuse and accu- 
mulation at the RF level for a// types of data (weights, input activations, and partial sums) for 
the overall energy efficiency. This differs from weight-stationary, output-stationary, or input- 
stationary dataflows, which optimize only for reducing the energy of accessing weights, partial 
sums, or input activations, respectively. 

‘The row-stationary dataflow assigns the processing of a 1-D row convolution into each 
PE for processing, as shown in Figure 5.23. It keeps the row of filter weights stationary inside 
the RF of the PE and then streams the input activations into the PE. The PE does the MACs 
for each sliding window at a time, which uses just one memory space for the accumulation of 
partial sums. Since there are overlaps of input activations between different sliding windows, 
the input activations can be kept in the RF and get reused. By going through all the sliding 
windows in the row, it completes the 1-D convolution and maximizes the data reuse and local 
accumulation of data in this row. 

With each PE processing a 1-D convolution, multiple PEs can be aggregated to complete 
the 2-D convolution, as shown in Figure 5.24. For example, to generate the first row of output 
activations with a filter having three rows, three 1-D convolutions are required. Therefore, it 
can use three PEs in a column, each running one of the three 1-D convolutions. The partial 
sums are further accumulated vertically across the three PEs to generate the first output row. To 
generate the second row of output, it uses another column of PEs, where three rows of input 
activations are shifted down by one row, and use the same rows of filters to perform the three 1- 
D convolutions. Additional columns of PEs are added until all rows of the output are completed 
(i.e., the number of PE columns equals the number of output rows). 
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Figure 5.23: 1-D convolutional reuse within PE for row-stationary dataflow [142]. 
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Figure 5.24: 2-D convolutional reuse within spatial array for row-stationary dataflow [142]. 
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This 2-D array of PEs enables other forms of reuse to reduce accesses to the more expensive 
global buffer. For example, each filter row is reused across multiple PEs horizontally. Each row 
of input activations is reused across multiple PEs diagonally. And each row of partial sums 
are further accumulated across the PEs vertically. Therefore, 2-D convolutional data reuse and 
accumulation are maximized inside the 2-D PE array. 

To address the high-dimensional convolution of the CONV layer (i.e., multiple feature 
maps, filters, and channels), multiple rows can be mapped onto the same PE, as shown in Fig- 
ure 5.25a. The 2-D convolution is mapped to a set of PEs, and the additional dimensions are 
handled by interleaving or concatenating the additional data. For filter reuse within the PE, dif- 
ferent rows of feature maps are concatenated and run through the same PE as a 1-D convolution 
(Figure 5.25b). For input feature map reuse within the PE, different filter rows are interleaved 
and run through the same PE as a 1-D convolution (Figure 5.25c). Finally, to increase local 
partial sum accumulation within the PE, filter rows and feature map rows from different chan- 
nels are interleaved, and run through the same PE as a 1-D convolution. The partial sums from 
different channels then naturally get accumulated inside the PE (Figure 5.25d). 

The number of filters, channels, and feature maps that can be processed at the same time 
is programmable, and there exists an optimal mapping for the best energy efficiency, which 
depends on the layer shape of the DNN as well as the hardware resources provided, e.g., the 
number of PEs and the size of the memory in the hierarchy. Since all of the variables are known 
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Figure 5.25: Multiple rows of different input feature maps, filters, and channels are mapped to 
same PE within array for additional reuse in the row-stationary dataflow [142]. 
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Figure 5.26: Mapping optimization takes in hardware and DNNs shape constraints to determine 
optimal energy dataflow [142]. 


before runtime, it is possible to build a compiler (i.e., mapper as described in Chapter 6) to 
perform this optimization off-line to configure the hardware for different mappings of the row- 
stationary dataflow for different DNNs, as shown in Figure 5.26. This is analogous to how 
compilers can optimize the binary for specific CPU or GPU architectures (Section 6.2). In 
Sections 6.4 and 6.5, we will introduce frameworks to perform analysis on energy efficiency and 
performance for optimization. 

One example that implements the row-stationary dataflow is Eyeriss [160]. It consists of 
a 14x12 PE array, a 108KB global buffer, ReLU and feature map compression units, as shown 
in Figure 5.27. The chip communicates with the off-chip DRAM using a 64-bit bidirectional 
data bus to fetch data into the global buffer. The global buffer then streams the data into the PE 
array for processing. 

In order to support the row-stationary dataflow, two problems need to be solved in the 
hardware design. First, how can the fixed-size PE array accommodate different layer shapes? 
Second, although the data will be passed in a very specific pattern, it still changes with different 
shape configurations. How can the fixed design pass data in different patterns? 

Two mapping strategies can be used to solve the first problem, as shown in Figure 5.28. 
First, replication can be used to map shapes that do not use up the entire PE array. For example, 
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Figure 5.27: Eyeriss DNN accelerator [160]. 
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Figure 5.28: Mapping uses replication and folding to maximized utilization of PE array [160]. 


in the third to fifth layers of AlexNet, each 2-D convolution only uses a 13x3 PE array, where 
3 is their filter height (R) while 13 is the output feature map height (P). This structure is then 
replicated four times, and runs different channels and/or filters in each replication. The second 
strategy is called folding. For example, in the second layer of AlexNet, it requires a 27x5 PE 
array to complete the 2-D convolution. In order to fit it into the 14x12 physical PE array, it 
is folded into two parts, 14x5 and 13x5, and each are vertically mapped into the physical PE 
array. Since not all PEs are used by the mapping, the unused PEs can be clock gated to save 
energy consumption. 
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A custom multicast network is used to solve the second problem related to flexible data 
delivery. The simplest way to pass data to multiple destinations is to broadcast the data to all PEs 
and let each PE decide if it has to process the data or not. However, it is not very energy efficient 
especially when the size of PE array is large. Instead, a multicast network is used to send data 
to only the places where it is needed. This is achieved by using the multicast controllers on the 
NoC paths that only pass data when there are destinations that require the data downstream. To 
determine which data to pass through each controller, each data is sent from the global buffer 
to the NoC with a tag value. Each multicast controller is also configured with an ID off-line. 
The controller then checks if the tag matches its local ID to determine the passing of data. 


5.75 OTHER DATAFLOWS 


The dataflows introduced in the taxonomy are often used as building blocks to create new 
dataflows. Since the stationariness of a dataflow is only relative to a specific level of memory 
hierarchy, different stationariness can be used at each level of the memory hierarchy.’ For ex- 
ample, the dataflow can work as weight stationary at the RF storage level but output stationary 
at the global buffer storage level. From the perspective of a loop nest, it involves tiling the various 
data types into multiple levels and then reordering the loops at different levels to prioritize dif- 
ferent data types. For example, if there are K levels of memory hierarchy, both dimension Q and 
S in the loop nest of Figure 5.8a can be further divided up into K loops (i.e., for the open ranges 
[Qo, Qx) and [Sp to Sx) ) and then reordering the loops independently from loop level 0 to 
level K — 1. The same strategy to make reordering decisions can be applied to each storage level; 
in other words, the design is fractal. This also implies that smaller DNN accelerator designs can 
be further combined together with a new level of memory to form a larger accelerator. 

In addition to curating one dataflow for the accelerator, there are recent works that ex- 
plore supporting multiple dataflows in a single flexible hardware, including FlexFlow [162], 
DNA [163], and Maeri [164]. The key to these designs is to propose flexible NoCs and sup- 
port various memory access patterns in order to execute different dataflows that have different 
numbers of loop levels and loop ordering in the loop nest. 

While there are endless possibilities for creating new dataflows and optimizing the hard- 
ware architecture for those dataflows, the combination of dataflow(s) and hardware architecture 
that results in the best performance and energy efficiency is still an open research question since 
it heavily depends on the problem size, technology, and amount of hardware resources. There- 
fore, it is crucial to be able to systematically and efficiently analyze different design decisions. 
In Chapter 6, we will introduce several tools that can help with such an analysis. 


%Ina previous version of the taxonomy introduced in [142], there is an additional dataflow called no local reuse (NLR), 
which keeps no data stationary locally within the PE. The NLR dataflow, however, can be classified as one of the three 
stationary dataflows (i.e., WS, OS, or IS) at the next storage level, e.g., the global buffer. 
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5.7.6 DATAFLOWS FOR CROSS-LAYER PROCESSING 


Up until now, we have been looking at dataflows and the associated architectures that focus on 
processing one layer at a time. The processing of multiple layers has to be scheduled sequentially, 
which may involve reconfiguration of the hardware to adapt to the varying shapes and sizes 
of different layers and possibly moving activations to and from DRAM between layers. ‘These 
designs were based on the assumption that it has been unlikely for the accelerator to fit an 
entire layer at once for processing; therefore, it does not make much sense to consider processing 
multiple layers at the same time. 

However, as more hardware resources are being devoted for DNN acceleration and the 
DNN models are becoming more compute and storage efficient to achieve the same accuracy 
for various applications, dataflows and hardware that can process multiple layers at a time can 
be beneficial. Specifically, the output activations from one layer are often used directly as the 
input activations of the next layer. By keeping the activations in the local memory for cross-layer 
processing, it can save additional memory accesses at higher levels, such as DRAM. However, 
this is often at the cost of requiring more storage for weights from different layers, and can only 
process a smaller tile of activations in each layer at a time. 

Several previous works have proposed dataflows and architectures for cross-layer process- 
ing. For example, Fused-layer [165] focuses on saving memory accesses for activations to process 
across multiple convolutional layers in a DNN, while Shortcut Mining [166] targets the data 
reuse from the shortcut connections in residual networks. BRein [167] exploits similar ideas by 
running 13 layers at a time on the same hardware thanks to using binary/ternary weights for the 
DNN to keep storage requirements small. Brainwave [168] takes the idea of processing many 
layers at a time to the extreme by aggregating many FPGAs, each processing one layer at a time, 
to form a pipelined chain of processing fabric. In this case, each FPGA can be configured with 
the dataflow that is optimized for a specific layer. 

Pipelined computation of multiple layers adds many additional considerations. First, a 
layer-granularity pipeline can only be exploited if a series of input feature maps need to be pro- 
cessed (i.e., a batch size greater than one). Otherwise, only a single stage of the pipeline would be 
busy, although tile-level pipelining could still provide some benefit. Second, pipelining layers in- 
troduces requirements for inter-stage storage of activations (since a key motivation was avoiding 
dumping and restoring activations from large expensive storage between layers). This has to be 
managed by careful tiling and/or selection of dataflows, which may vary between layers. For ex- 
ample, Tangram [169] uses tiling to control inter-stage activation storage and BRein [167] saves 
intermediate state for some layer pairs by feeding an output-stationary dataflow into an input- 
stationary dataflow. Finally, the processing and communications in each stage of the pipeline 
should be balanced to keep throughput high, as was explored in Simba [123]. All these consid- 
erations add considerable complexity to the design and mapping optimization process, and we 
are unaware of any comprehensive evaluation or solution to these issues. 
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Although dataflows are a key design element of any DNN accelerator, other hardware 
elements are important as well. In the next two sections, we review important considerations for 


the design of buffers and NoCs for DNN accelerators. 


5.8 DNN ACCELERATOR BUFFER MANAGEMENT 
STRATEGIES 


As was described above, data buffering is one of the key components of creating an efficient 
DNN accelerator. So, beyond selecting the appropriate dataflow, which control where and when 
data is buffered, an DNN accelerator needs a buffering scheme that should seek to achieve the 
following objectives: 


e efficiently (e.g., in storage requirements) and in a timely fashion transfer exactly the data 
that will be needed in the future by the consumer of the data; 


e overlap the receipt of data that will be needed in the future with the use of data currently 
being consumed; 


* remove data exactly when it is no longer needed; and 


e do all of the above with precise and cheap synchronization. 


Generically, striving to achieve these objectives is called achieving good data orchestra- 
tion. A classification of the current approaches for data orchestration from [170] is illustrated in 
Figure 5.29. In the figure, buffering idioms are split along two axes. At a high level, the implic- 
it/explicit distinction along one axis refers to the degree to which workload knowledge is lever- 
aged to control data buffering decisions, while the coupled/decoupled on the other axis refers 
to whether memory responses and requests are round-trip (request-response) or flow-forward 
(data is automatically pushed to consumer). 


5.8.1 IMPLICIT VERSUS EXPLICIT ORCHESTRATION 


In the general-purpose computing community, caches are the predominant buffering mecha- 
nism and are based on load/store (i.e., round-trip) operations. Caches have several desirable 
properties, such as composing invisibly into hierarchies. Memory-level parallelism—both mul- 
tiple outstanding fills, as well as concurrency between fills and accesses to current contents—can 
be achieved using well-studied additional hardware (often called /ockup-free cache structures). 
Caches can be characterized as performing implicit data orchestration as the load request 
initiator does not directly control the cache hierarchy’s decisions about whether the response 
data is retained at any given level of the storage hierarchy, nor when it is removed. Heuristic 
replacement policies are advantageous in general-purpose scenarios because they are workload 
agnostic.'° On the other hand, for DNN accelerators, the area and energy overheads for features 


10 As many programmers care more about optimization than portability, they often reverse engineer the details of the cache 
hierarchy and replacement policy to try to explicitly manipulate them. 
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Figure 5.29: Taxonomy of data orchestration approaches. Communication (lines with arrow- 
heads) is assumed to travel on a hardware channel (usually a NoC link). (Figure from [170].) 


like tag matches and associative sets are high, and so far as we are aware no contemporary DNN 
accelerator incorporate caches. 

An alternative to caches is to use scratchpads, which expose an address range of a particular 
staging buffer for loads and stores, thereby enabling exp/icit and precise control over the data 
orchestration. (In Figure 5.29 this is represented by the datapath managing both local and global 
requests/responses.) A GPU's shared memory scratchpad [171] is a widespread contemporary 
example of this idiom for explicit data orchestration. The size and address range of the scratchpad 
is exposed architecturally, and the transfer of data into and out of the scratchpad is managed 
via explicit instructions. While scratchpads avoid the hardware overheads of caches, extracting 
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memory parallelism—both across fills and overlapping fills and accesses—is tedious and error- 
prone,'! and as a result they are difficult to compose into hierarchies. 


5.8.2 COUPLED VERSUS DECOUPLED ORCHESTRATION 


Caches and scratchpads both use a load/store paradigm where the initiator of the request also 
receives the response. ‘This is referred to as coupled staging of data, reflected in the left column 
of Figure 5.29. With this setup, synchronization between data demand and data availability is 
efficient and intuitive—the requester is notified when corresponding response returns (load-to- 
use). The disadvantage to this approach is that it complicates overlapping the fill and access of 
data tiles (e.g., via double-buffering) as the single requester/consumer must alternate between 
requesting and consuming responses. Additionally, a “landing zone” for the incoming data tile 
must be held reserved (and are therefore idle) for the entire round-trip load latency, which in- 
creases pressure on memory resources that could otherwise be used for larger tile sizes.'” 

The alternative is to decouple the load request initiator from the response receiver. (In 
Figure 5.29 this is represented by the request/response arrows going to different modules.) In 
this setup, a separate hardware module (e.g., a DMA engine, or address generator (AGEN)) is 
responsible for pushing data into one or more functional units’ buffers.'’ To tolerate latency, these 
are often double-buffered and hence sometimes referred to as ping-pong buffers [172, 173]. The 
main advantage to this approach is that the requester can run at its own rate, and can multicast 
data to multiple simultaneous consumers. Additionally, the feed-forward nature of the pipeline 
means that the tile landing zone only needs to be reserved proportional to the latency between 
adjacent levels of the hierarchy, rather than the entire hierarchy traversal round-trip, allowing 
for increased utilization of equivalent sized memory. Finally, this approach often can transmit 
large blocks of data (i.e., bulk transfers, which are more efficient than small requests) which must 
dynamically re-coalesce accesses to the same memory line. 

This separate producer/consumer approach is similar to Smith’s [174] decoupled access- 
execute (DAE) style of general-purpose computing architecture. In a DAE organization two 
processors are connected by a hardware queue. ‘The access processor is responsible for performing 
all address calculations and generating loads—analogous to the DMA engine. Load responses 
are passed to the execute processor—analogous to an accelerator’s functional units and their 
local staging buffers. DAE improves parallelism and reduces the critical paths of instructions 
while allowing both processors to compute at their natural rate. However, classical DAE does 
not explicitly control data orchestration buffers—decisions about staging data are still managed 
by the cache hierarchy, thus Figure 5.29 categorizes DAE as implicit decoupled. 


11GPU shared memory is paired with high multi-threading and loop unrolling to offset these problems, but this complexity 
is almost certainly unacceptable for a more specialized accelerator. 

12The magnitude of this effect can be evaluated using Little’s Law [118]. 

13 Cache pre-fetching can be considered an example of decoupling. Consideration of this large body of work is beyond the 
scope of this book. 
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5.8.3 EXPLICIT DECOUPLED DATA ORCHESTRATION (EDDO) 


The most common buffering approach in DNN accelerators is explicit decoupled data orchestra- 
tion (EDDO). Hardware FIFOs [175, 176] are one traditional reusable EDDO staging buffer 
organization. The advantages are that FIFOs cleanly encapsulate synchronization via head and 
tail pointers, and are easily hierarchically composable. However, in practice FIFOs are not flex- 
ible enough to meet the needs of DNN accelerators, which often repeats accesses within a win- 
dow of a tile (e.g., when performing a convolution). Additionally, for data types such as the 
partial sums, staged data must be modified several times in place before being drained. This is 
not possible in single write-port FIFOs without costly re-circulation. 

Explicit decoupled data orchestration (EDDO) schemes have been incorporated as a cus- 
tomized buffering mechanism in some DNN accelerators [142, 152, 159, 177, 178] and other 
specific EDDO buffering schemes, such as DESC [179], have been proposed. However, to il- 
lustrate a typical EDDO scheme we will describe buffers [170], which are a generalization of the 
data orchestration scheme in Eyeriss [101]. 

At its heart, the operation of a buffet is FIFO like in that values are filled from an input 
NoC link (i.e., a hardware communication channel) into a circular buffer controlled by head and 
tail pointers. Values will only be removed from the fill NoC link if the fill occurs. Access to data 
in the buffer is provided by a read command, however unlike a FIFO which can only read at 
its head, a buffet read is augmented with an address, which is interpreted as an offset from the 
head. Keeping a set of values in the buffet and reading them multiple times allows for reuse of 
a tile of data. Analogous to the fill, the read command will only execute if the read value can be 
sent on the read value NoC link (i.e., the NoC link is not blocked). 

Buffets also support updates of values in its buffer. Updates only are allowed at a previously 
read value at a location read with a read+update command. This allows a buffet to support storing 
and updating partial sums. 

Finally, a buffet provides a shrink operation that removes a specified number of entries 
from the head of the buffer. The shrink allows one to easily free the space occupied by the tile. 
To avoid costly delays when switching tiles, one can define a tile size that is smaller than the 
size of the buffet. Therefore, fills of the next tile can begin transparently while the previous tile 
is being processed. However, the extra space only needs to be large enough to avoid the startup 
transient prior to work starting on the next tile. That is often much less than the space required 
for double buffering. 

Shrinks need not remove an entire tile. Removing only a portion of a tile (e.g., just one 
value) and then reading sequentially again starting at offset zero allows buffets to support sliding 
windows. 

Figure 5.30 shows a block diagram of a buffet. Actions occur when there are values on all 
the input NoC links needed by the action (command or fill) and there is room in the output 
NoC link (only needed for reads). The activity illustrated in the figure is the following. 
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Figure 5.30: Buffet Block Diagram—a block diagram of the major components of a buffet. The 
principle inputs are a fi// value; a read address and read value; an update address, and update value; 
and a command, which can specify whether to perform a read, read+update, or shrink. The only 
output is a read value. The head, tail, and up-to-date units internally provide synchronization, 
stalling operations to preserve proper ordering. 





































































































° A read command (r) is being invoked that takes read address (1) as an offset from ead to 
produce the read value (d). 


e An update at update address (3) is writing an update value (f’) into the buffet. Note, this is 
allowed because an earlier command must have been a read+update at offset 3. 


° A fill value (k) is about to be written into the zail of the buffet. 


Note that all of the above activity is mediated locally within the buffet by the ead, tail, 
and up-to-date state, which guarantees proper ordering. For example, a read must wait until its 
data has been filled and updated (if there was a prior read+update). And a fill must wait until 
there is room in the buffer. 

Not illustrated is a shrink command, which simply removes a given number of values from 
the head of the buffet by adjusting the bead pointer after waiting for outstanding updates. 

Figure 5.31 shows a toy example that demonstrates how buffets naturally compose and can 
be used to process sliding windows and updates. The natural composition of the L1 Input Buffet 
by the LO Input Buffet allows external synchronization-free filling, because the filling is con- 
trolled by internal ordering controls in each buffet. The shrinks by one in the LO Input Buffet cre- 
ate a sliding window of inputs, so a relative sequence of inputs would be 0, 1,2, 1,2, 3,2,3, 4..... 
Internal synchronization also controls the updates of partials sums. 
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Figure 5.31: Buffet Example—an artificial example of an Eyeriss-like global buffer and PE 
built with buffets. Reads to the L1 Input Buffet fill the LO Buffet, which performs reads that 
pass a sliding window of inputs to the multiplier. The LO Output Buffer performs a series of 
read+update commands to generate the partial sums. The weight buffet is not shown. 
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In summary, efficient data orchestration is needed in DNN accelerator designs and will 
usually be provided by mechanisms, like buffets, that manage data movement. This generally 
manifests in the design as explicit control where data is pushed deterministically through the 
storage hierarchy avoiding costly round-trip communication and minimizing “landing zone” 
storage requirements. Efficiency is also enhanced by decoupled activity where the hardware pro- 
vides local determination of the values needed and local synchronization controls. Obviously, 
the full semantics of a buffet is not needed for every storage buffer, so an optimized implemen- 
tation of a subset of those semantics or other custom design that provides comparable benefits 
can be employed. 


5.9 FLEXIBLE NOC DESIGN FOR DNN ACCELERATORS 


The NoC is an indispensable part of modern DNN accelerators in order to support the data 
delivery patterns of various dataflows, as described in Section 5.7, and its design has to take the 
following factors into consideration: (1) support processing with high parallelism by efficiently 
delivering data between storage and datapaths; (2) exploit data reuse to reduce the bandwidth 
requirement and improve energy efficiency; and (3) can be scaled at a reasonable implementation 
cost. 

Figure 5.32 shows several NoC designs commonly used in DNN accelerators. Due to the 
property of DNN that data reuse for all data types cannot be maximally exploited simultane- 
ously, a mixture of these NoCs is usually adopted for different data types. For example, a DNN 
accelerator can use a 1-D horizontal multicast network to reuse the same weight across PEs in 
the same row and a 1-D vertical multicast network to reuse the same input activation across 
PEs in the same column. ‘This setup will then require a unicast network that gathers the unique 
output activations from each PE. This combination, however, implies that each weight needs to 
have the amount of reuse with different input activations at least equal to the width of the PE 
array, and the number of input activation reuse with different weights at least equal to the height 
of the PE array. If these conditions are not fulfilled, the PE array will not be fully utilized, which 
will impact both throughput and energy efficiency. 

Figure 5.33 shows two designs with the example NoCs that are commonly used in many 
existing DNN accelerators [135, 145, 156, 157, 180-182]. A spatial accumulation array archi- 
tecture (Figure 5.33a), which is often used for a weight-stationary dataflow, relies on both output 
and input channels to map the operations spatially onto the PE array to exploit parallelism. At 
the same time, each input activation can be reused across the PE array vertically with weights 
from different output channels, while partial sums from the PEs in the same row can be further 
accumulated spatially together before written back to the global buffer. Similarly, a temporal 
accumulation array architecture (Figure 5.33b), which is often used for an output-stationary 
dataflow, relies on another set of data dimensions to achieve high compute parallelism. In this 
case, each input activation is still reused vertically across different PEs in the same column, while 
each weight is reused horizontally across PEs in the same row. 
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Figure 5.32: Common NoC designs. 
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Figure 5.33: Two common DNN accelerator designs: (a) spatial accumulation array [135, 145, 
156, 157]: input activations (iacts) are reused vertically and partial sums (psums) are accumulated 
horizontally; and (b) temporal accumulation array [180-182]: input activations (iacts) are reused 
vertically and weights are reused horizontally. 


When the set of pre-selected data dimensions diminish due to the change in DNN shapes 
and sizes, e.g., the number of output channels in a layer (M) is less than the height of the PE 
array, efficiency decreases. Specifically, these spatial mapping constraints result in both reduced 
array utilization (i.e., fewer PEs are used) as well as lower energy efficiency. Furthermore, these 
inefficiencies are magnified as the size of the PE array is scaled up, because the diminished di- 
mension is even more likely to be unable to fill the array. For example, as shown in Figure 5.34, 
the aforementioned spatial and temporal accumulation arrays will find it difficult to fully uti- 
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Figure 5.34: Array utilization of different architectures for depth-wise (DW) convolutions in 
MobileNet. The colored blocks are the utilized part of the PE array. For Eyeriss [101], the 
different colors denote the parts that run different channel groups (G). Please refer to Table 2.1 
for the meaning of the variables. 
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Figure 5.35: The pros and cons of different NoC implementations. 


lize the array due to the lack of input and output channels when performing depth-wise (DW) 
convolutions in MobileNet [183] (see Section 9.1.1 and Figure 9.6 for more on depth-wise 
convolution). In contrast, Eyeriss [101] can still achieve high array utilization under such cir- 
cumstances by mapping the independent channel groups onto different part of the PE array due 
to the flexibility of its row-stationary dataflow. 

‘The varying amount of data reuse for each DNN data type across different layers or models 
pose a great challenge to the NoC design. As shown in Figure 5.35, the broadcast network can 
exploit the most data reuse, but its low source bandwidth can limit the throughput when data 
reuse is low. The unicast network can provide the most source bandwidth but misses out on the 
data reuse opportunity when available. Taking the best from both worlds, an all-to-all network 
that connects any data sources to any destinations can adapt to the varying amount of data reuse 
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Figure 5.36: (a) High-level structure of the hierarchical mesh network (HM-NoC), and its dif- 
ferent operating modes; (b) high bandwidth mode; (c) high reuse mode; (d) grouped-multicast 
mode; and (e) interleaved-multicast mode. In each mode, the colored arrows show the routing 
path; different colors denote the path for unique data. 


and bandwidth requirements. However, the cost of its design increases quadratically with the 
number of nodes, e.g., PEs, and therefore is difficult to scale up to the amount of parallelism 
required for DNN accelerators. 


5.9.1 FLEXIBLE HIERARCHICAL MESH NETWORK 


To deal with this problem, Eyeriss v2 [161] proposed a new NoC architecture for DNN acceler- 
ators, called hierarchical mesh network (HM-NoC), as shown in Figure 5.36a. HM-NoC takes 
advantage of the all-to-all network, but solves the scaling problem by creating a two-level hier- 
archy. The all-to-all network is limited within the scope of a cluster at the lower level. There are 
usually only dozens of PEs within each cluster, which effectively reduce the cost of the all-to-all 
network. At the top level, the clusters are further connected with a mesh network. While this 
example shows a 2x1 mesh, an actual design can have a much larger mesh size. Scaling up the 
architecture at the cluster level with the mesh network is much easier than with the all-to-all 
network since the implementation cost increases linearly instead of quadratically. 

Figure 5.37 shows several example use cases on how HM-NoC adapts different modes for 
different types of layers. For simplicity, we are only showing a simplified case with 2 PE clusters 
with 2 PEs in each cluster, and it omits the NoC for partial sums. However, the same principles 
apply to NoC for all data types and at larger scales. 


e Conventional CONV layers (Figure 5.37a): in normal CONV layers, there is plenty of 
data reuse for both input activations and weights. To keep all four PEs busy at the lowest 
bandwidth requirement, we need two input activations and two weights from the data 
source (ignoring the reuse from RF). In this case, either the HM-NoC for input activation 
or weight has to be configured into the grouped-multicast mode, while the other one 
configured into the interleaved-multicast mode. 


Depth-wise (DW) CONV layers (Figure 5.37b): for DW CONV layers, there can be 


nearly no reuse for input activations due to the lack of output channels. Therefore, we can 
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Figure 5.37: Examples of weight and input activation hierarchical mesh networks configured 
in different modes for different types of DNN layers: (a) CONV layers; (b) depth-wise (DW) 
CONV layers; and (c) FC layers. Green arrows and blue arrows show the routing paths in the 
weight and input activation NoC, respectively. 


only exploit the reuse of weights by broadcasting the weights to all PEs while fetching 
unique input activation for each PE. 


FC layers (Figure 5.37c): contrary to the DW CONV layers, FC layers usually see little 
reuse for weights, especially when the batch size is limited. In this case, the modes of 
input activation and weight NoCs are swapped from the previous one: the weights are 
now unicast to the PEs while the input activations are broadcast to all PEs. 


While conventional mesh NoC implementations often come with high area and power 
overhead due to the need to route the data dynamically, it is not the case for HM-NoC since 
it does not require routing at runtime. Instead, all active routes are determined at configuration 
time based on the specific data delivery pattern in use. As a result, no flow control is required, 
and the routers are simply multiplexers for circuit-switched routing that has minimum imple- 
mentation cost. 
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Figure 5.38: A DNN accelerator architecture built based on the hierarchical mesh network. 


Figure 5.38 shows an example DNN accelerator built based on the hierarchical mesh 
network. The router clusters are now connected in a 2-D mesh. The global buffer (GLB) is 
banked and distributed into each source cluster, and the PEs are grouped into the destination 
clusters instead of one single array. 


5.10 SUMMARY 


In this chapter, we presented the motivation, principal objectives and key design alternatives for 
specialized hardware for DNN accelerators. The tremendous degrees of freedom in the ordering 
of the MAC operations in DNN computations, the need for efficient data movement and the 
desire to flexibly handle many DNN workload shapes, leads to a large design space. This chapter 
concentrated on understanding that design space from the perspective of performance and en- 
ergy efficiency with a particular emphasis on reuse and how to exploit it spatially and temporally 
in a multi-level storage hierarchy. Central to that topic was the notion of dataflows, how they 
manifest in various existing designs and how they can be expressed precisely with the use of 
loop nests. Also presented were ideas related to efficient buffering using explicitly decoupled data 
orchestration (EDDO) and flexible NoC design. Left to be explored in the next chapter is the 
process of finding optimal mapping of specific workload shapes onto a given DNN acclerator. 
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CHAPTER 6 


Operation Mapping on 
Specialized Hardware 


In Chapter 5, we discussed various key design considerations and techniques for the implemen- 
tation of specialized DNN hardware. Also introduced was the notion of the mapping of the 
computation for a particular workload layer shape onto a specific DNN accelerator design, and 
the fact that the compiler-like process of picking the right mapping is important to optimize 
behavior with respect to energy efficiency and/or performance.' 

Mapping involves the placement and scheduling in space and time of every operation (in- 
cluding delivering the appropriate operands) required for a DNN computation onto the hard- 
ware function units of the accelerator.? Mapping has the following key steps. 


1. Determine the dataflow: If the targeted DNN accelerator supports more than one dataflow, 
then the mapping must select a dataflow. In the loop nest representation of a design in- 
troduced in Chapter 5, a dataflow is manifest in the number and order of for loops. For 
multiple levels of storage, the loops can have an independently selectable order at each 
level. The choice among all these myriad orders can have a major impact on the behavior 
of an accelerator, and so is an important component of mapping. However, just choosing 
a dataflow is not sufficient, because it does not define the loop bounds. 


2. Determine the data tile sizes: After selecting the dataflow, we need to determine the tile of 
data for each data type that each instance of storage at each level works on within a certain 
duration of time. Thus, this fi/ing is in both space and time and can have a major impact 
on the amount of data that needs to be moved. For example, with the weight-stationary 
(WS) dataflow at storage level L1, the sizes of the input activation tile and partial sum tile 
determine how many times the weight tile will be reused in L1, and the storage capacity 
of L1 constrains the sizes of these three data tiles. From the point of view of the loop nest 
representation of a design, this step determines the for loop bounds. 


3. Bind the operations to the hardware: Given the loop nest order and fixed loop bounds, the 
final step is to determine where the operations (arithmetic computation or data access) at 


1For DNN accelerators capable of processing multiple layers at once the notion of mapping must be expanded to consider 
optimizing more globally. Such considerations are beyond the scope of this chapter. 

2Some papers refer to this activity just as scheduling, but we use the word mapping to emphasize its relevance to both 
space and time. 
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each iteration of the loop binds to the hardware (either PE or storage buffer). For exam- 
ple, one possibility is to simply define that the arithmetic operation at iteration i of the 
parallel-for loop goes to the PE with ID i. Without losing generality, we will assume 
this simple binding is employed in the rest of this chapter. However, more complicated 
binding schemes can be beneficial in certain cases, for instance, when the range of the 
parallel-for loop is not equal to the number of PEs, or when some physical attribute 
of the hardware can be taken into account (e.g., the physical proximity of specific PEs or 
buffers to one another). 


‘The set of all possible mappings for a specific workload shape on to a specific DNN ac- 
celerator is called the map space for that workload. Given the number of degrees of freedom in 
mapping, map spaces can be very large. Furthermore, the fact that many attributes of an accel- 
erator’s design are fixed, such as the number of PEs, the number and sizes of buffers and the 
connectivity of the networks-on-chip (NoCs), lead to complex constraints on which mappings 
are legal. Therefore, map spaces tend to be irregular (i.e., there are many illegal points in the 
unconstrained map space). 

‘The large size and irregularity of maps spaces and the fact that different mappings can 
result in drastically different performance and/or energy efficiency on the hardware, leads to a 
strong desire to be able to find the optimal mapping in the map space. The tool that is used to 
find such a mapping is called a mapper. To do its job, a mapper needs to both be able to express 
and search the map space and evaluate and quantify the energy efficiency and performance of 
different mappings. In this chapter, we will present the mechanics of performing such searches, 
including some analysis frameworks that can be used to systematically characterize a mapping. 
But first, we will explore a bit more deeply what constitutes a mapping. 


6.1 MAPPING AND LOOP NESTS 


Mapping a workload shape to a DNN accelerator involves a number of choices including: pick- 
ing a dataflow from among those supported by the accelerator and picking both the spatial and 
temporal tiling of all the operands to the storage and computational components of the accel- 
erator. For both CONV and FC? layers, these factors can be represented in terms of the loop 
nests described in Chapter 5. Therefore, one can view the process of creating a mapping from 
the perspective of selecting and parameterizing loop nests. 

To provide an instructive example of mapping onto a DNN accelerator design, consider 
the simple DNN accelerator design depicted in Figure 6.1. The architecture is assumed to have 
two PEs each of which handles processing of a single-input-channel, 1-D convolution for mul- 
tiple output channels. Each PE’s buffers can hold a copy of the input feature map for the one 
and only input channel, as well as the output feature map and filter weights for the output chan- 
nel they are currently working on. Since the PEs can only hold the weights for a single output 


3Recall from Section 4.1, the matrix multiply of an FC layer can be represented as a convolution with R = H and 


S=W. 
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Figure 6.1: Mapping target architecture for single input channel 1-D convolution with multiple 
output channels. This simple design has a global buffer and two PEs each with buffers for a single 
channel’s worth of input and output feature maps and filter weights. Each PE is configured with 
the input feature map preloaded and processes each output channel by loading the filter weights 
for that channel from the global buffer, computing the output feature map for that channel and 
sending that output feature map back to the global buffer. 


channel at a time, processing a new output channel requires that its weights be loaded from the 
global buffer, and the output feature map from the prior output channel is sent to the global 
buffer. Finally, we assume the PE can be configured to run either an output-stationary or a 
weight-stationary dataflow. 

A loop nest corresponding to that design is shown in Design 6.1 below. The design per- 
forms a single-input-channel, multiple-output-channel 1-D convolution using an input feature 
map (i[]) and filter weights (f [] ) generating a output feature map with multiple output chan- 
nels (o[]). Those variables are defined (along with their sizes) in lines 1-3. Skipping lines 7-9 
for the moment, the outermost for loop (line 13) represents a tiling of the M output channels, 
and each iteration of that loop results in a tile-sized chunk of weights being read from the global 
buffer. Those tiles are fed to the parallel PE units that are represented by the parallel-for in 
line 17, where m1 corresponds to the ID of the PE doing the processing. ‘The index of the active 
output channel is calculated in line 18 from the output channel tile number (m2) and the ID 
of the PE doing the processing (m1). Finally, the actual convolutions are performed in a PE 
processing unit using either a weight-stationary dataflow (lines 23-26) or an output-stationary 
dataflow (lines 28-31). 

In the loop nest for Design 6.1, there are a set of variables that control its behavior. First 
are variables W, M, S, and Q used in lines 1-3 that define the shape of workload. Second, are the 
variables M2, M1, and PE_dataflow that control the behavior of the loop nest. For this design, 
those variables constitute the mapping of the workload onto the design, because they control 
the placement and scheduling of activity in both space and time. In the code, those variables 
are assigned from the associative array with the unsurprising name mapping. The contents of 
mapping are assumed to have been set by a mapper that picked them based on some optimization 
criteria. 

In this design, there are only two components of the mapping. The first component of 
the mapping is a selection of one of two the dataflows supported. This is controlled by the 


“oy 


PE_dataflow mapping variable, which specifies the iteration order of the index variables (“s 
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Design 6.1 Example Mapping Target - parallel output channels 





1 i = Array(W) # Input feature map 
2 f = Array(M, S) # Filter weights 

3 o = Array(M, Q) # Output feature map 
5 # Mapping 

7 M2 = mapping[”M2”] 

8 M1 = mapping[’M1”] 

9 PE_dataflow = mapping|”PE_dataflow” 

1 # Level 2 — Global Buffer 

3 for m2 in [0, M2): 

5 # Level 1 — PE array 


7 parallel —for m1 in [0, M1): 
8 m =m2* M1 +m1 


0 # Level 0 — PE 


2 if PE_dataflow == ”sq” ): 

3 for s in [0, S): 

4 for q in [0, Q): 

5 w=qts 

6 o[m, q] += i[w] * f[m, s] 
7 elif PE_dataflow == qs”: 

s for q in [0, Q): 

9 for s in [0, S): 

0 w=qts 

1 o[m, q] += i[w] * f[m, s] 





RIF 


and “q”) used in the for loops. Thus, PE_dataflow essentially sets a permutation order the for 
the for loops representing the PE’s dataflow as either weights-outputs (“sq”) or outputs-weights 
(“qs”), which correspond to weight-stationary and output-stationary dataflows, respectively. 
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The second component is the spatio-temporal tiling of output channel-related data con- 
trolled by the mapping variables M2 and M1. The product of these variables must be at least equal 
to the number of output channels, so M <= M2 x M1. In addition, the spatial tiling size (M1) is 
limited by the actual number of PEs in the design, so M1 <= 2. 

‘The set of conditions that any mapping must satisfy are referred to as mapping constraints, 
and the objective of the mapper is to optimize system behavior within those constraints. In this 
simple example, it is clear that to optimize throughput one would like to set M1 = 2 to achieve 
full utilization of the PEs, but in more complex scenarios one cannot always achieve perfect 
utilization. Note that even with M1 = 2 an odd number of output channels (M) would result in 
under-utilization of the PEs in the last iteration of the m2 loop (line 13). 

The above example was very simple as there were few index variables at each level of the 
storage hierarchy (and none for additional dimensions of the input feature maps or filter weights, 
batch size or input channels) and only the output channels were tiled. Therefore, the mapping 
choices were both limited and generally obvious. In more realistic designs, many (and sometimes 
all) of the index variables will appear at each level of the hierarchy and a large number of loop 
order permutations might be available. This would make mapping space much larger. 

Another characteristic of more realistic designs would be the number and complexity of 
the constraints on mappings. Sources of such constraints include buffer capacity limits at each 
storage unit (either partitioned per operand type or allocated from a shared pool) and incom- 
plete NoC connectivity between units (including options for bypassing of storage levels). This 
results in a large and irregular mapping space which makes for a correspondingly more complex 
search. This process, however, has some similarities to the compilation process for conventional 
processors, which will be discussed next. 


6.2 MAPPERS AND COMPILERS 


The determination of a good mapping for a DNN accelerator can be viewed as being analo- 
gous to that of compiling for a general-purpose processor, as illustrated in Figure 6.2 [184]. 
In conventional computer systems, the compiler translates the program into machine-readable 
binary codes for execution; in the processing of DNNs, the mapper translates the desired DNN 
layer computation (i.e., problem specification) along with its shape and size* into a hardware- 
compatible mapping for execution. While the compiler usually optimizes just for performance, 
the mapper will typically optimize for performance and/or energy efficiency. 

As described in Chapter 5, the dataflow(s) that a DNN accelerator supports is a key at- 
tribute of the design. Therefore, the supported dataflow(s) can be thought of as analogous to one 
of the most salient attributes of the architecture of a general-purpose processor in the sense that 
it prescribes what constitutes a properly formed program for the system. Similar to the role of 


4While this chapter largely considers CONV layer computations so the problem specification consists of just a shape 
and size, it is natural to imagine this specification being extended to something more general, like an expression in the tensor 
index notation described in Chapter 2. 
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(b) DNN accelerator mapping flow 


Figure 6.2: An analogy between (a) the compilation process for general-purpose processors and 
(b) the mapping process for DNN accelerators. (Figure adapted from [184].) 


an instruction set architecture (ISA) or memory consistency model, dataflow characterizes the 
hardware implementation and defines the many of the rules that the mapper has to follow in 
order to generate hardware-compatible mappings. 

In many cases, a key characteristics of an architecture is that it remains stable (or only is 
evolved in an “upward compatible” fashion) across implementation generations. In this way, we 
also consider the dataflow as analogous to the architecture, since we believe that within a DNN 
accelerator family the set of available dataflows is going to largely remain invariant across im- 
plementations. However, like GPUs, the stability of some aspects of the architecture, including 
supported dataflow(s), could diminish for DNN accelerators due to their rapid evolution and 
increased reliance on the compiler/mapper to mask differences between designs. 
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In addition to the dataflows, other constraints on the mapping space would be included 
in the characteristics analogous to architecture. These reflect features such as buffer sizes and 
NoC connectivity that the mapper must consider if the resulting mapping is going to function 
correctly. The fact that attributes like buffer sizes are included as part of the architecture for the 
accelerator make it quite different from a processor’s architecture because storage sizes in a pro- 
cessor (like cache sizes) are generally not included in the architecture.” Thus, DNN accelerator 
buffer sizes are more akin to the size of a processor’s register file, which manifests as inviolable 
constraints for creating legal binaries. 

Detailed information about the hardware implementation, including latency, throughput 
and energy cost for storage accesses and NoC traffic at each level of the storage hierarchy, is 
analogous to the micro-architecture of processors for the following reasons: (1) they can vary 
considerably across implementations; and (2) although they play a vital part in performance and 
energy efficiency optimization, considering their characteristics is not essential, since even if 
ignored the mapper will generate functional, but sub-optimal mappings. 

‘The final input into either the compiler or mapper is behavioral statistics from the hard- 
ware. By using such statistics, the compiler/mapper can better decide on good optimizations via 
iterative refinement. However, in many cases this information is not available and the compil- 
er/mapper must make its own projections of the impact of a choice it makes. For DNN acceler- 
ators, this involves modeling the design to project metrics of interest for a given mapping. An 
example of how to generate such projections will be described in Section 6.4. 

Given the above information, the output of the compiler is an optimized binary program 
that can run on the processor. By analogy, the goal of the mapper is to search in the mapping 
space for the mapping that optimizes the metric(s) of interest and generate a configuration for 
the DNN accelerator that embodies that optimal mapping. Generally, a configuration will con- 
sist of a set of values that will be loaded into configuration registers of the DNN accelerator. 
Those configuration registers will then control the operation of the accelerator including the 
read/write access patterns to all the buffers, the NoC data transfer patterns and the sequence of 
computations at the PEs. 


6.3 MAPPER ORGANIZATION 


Given all the above inputs and outputs of the mapper, an abstract internal organization of a map- 
per can be envisioned using Figure 6.3. The flow involves creating a representation of the map 
space for the given workload and DNN accelerator, searching that space for optimal mappings 
based on the metrics of interest (e.g., performance, energy efficiency, or a combination), eval- 
uating the metrics for mappings proposed by the search, and after selecting a desired mapping 
creating a configuration for the accelerator that will result in an execution with that mapping. 


>The inclusion of buffer sizes in the architecture result from DNN accelerators typically using explicit (as opposed to 
implicit) data orchestration as will be discussed in Section 5.8. 
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Figure 6.3: Block diagram of a mapper - Given a DNN problem specification/shape and the 
dataflow(s)/constraints (architecture) of a DNN accelerator, the map space construction step creates 
a map space. The optimizing search step iterates over mappings from the map space. It evaluates 
their characteristics by sending the mapping to a performance/energy model that also takes in the 
problem specification/shape, dataflows(s)/constraints (architecture), and implementation details 
(micro-architecture) of the DNN accelerator and returns the performance and/or energy char- 
acteristics of that mapping. Ultimately the optimizing search generates a “best” mapping that 
the mapping to configuration step converts into a configuration. 


6.3.1 MAP SPACES AND ITERATION SPACES 


In Figure 6.3, two of the inputs to the mapper—DNN problem specification/shape and 
dataflow/constraints (DNN accelerator architecture°)—are inputs to the first step, which is la- 
beled map space construction. The responsibility of this step is to create an enumeration of all the 
possible legal mappings for the given problem shape and architecture. The gray box in the figure 
indicates the map space containing all the legal mappings in the map space. 

For some DNN accelerators (e.g., like the simple design in Section 6.1) it is possible to 
exhaustively enumerate the mappings in the map space. However, that is not generally feasible, 
so a more abstract representation of the map space can be useful. A common abstraction used 
to characterize a map space is to employ a concept commonly used in the compiler community 
called an iteration space, whose origin can be traced back to [185, 186]. For a problem specifi- 
cation consisting of a single loop nest, such as used for 1-D convolution (see Design 6.2), an 


®To simplify the descriptions in the rest of the chapter, the term architecture will be used to refer to the DNN dataflow/- 
constraints, and the term DNN micro-architecture will be used to refer to the DNN implementation details. 
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Design 6.2 1-D Convolution 





i = Array(W) # Input feature map 
f = Array(S) # Filter weights 
o = Array(Q) # Output feature map 
for q in [0, Q): 
for s in [0, S): 
w=q+s 


o[q] += i[w] * f[s] 





iteration space is a multi-dimensional space with one point for each execution of the body of 
the loop nest. Figure 6.4 is the iteration space for this simple 1-D convolution. Each yellow oval 
represents an execution of the loop body (MAC operation in line 8) at the given indices (w, s, q) 
of the convolution.’ Each point in the iteration space is also associated with points in the data 
spaces of the problem. In specific, each point in the data space is connected to the operands of 
the computation at that point. In this example, the data spaces are the inputs (i []), the filter 
weights (f []) and the outputs (o[]. Thus, the point (w, s, q) = (4, 1, 3) has data space operands 
of i[4], £[1], and 0[3]. 

Given an iteration space, an execution of the problem requires a visit to each point in the 
iteration space. For convolution, where the operations can be performed in any order (because 
there are no dependencies and the addition in the sum of products is commutative) one could 
visit the points in the iteration space in any order. Note, it is important to observe that the visit 
order is not determined by the specification that defined the layer’s computation, but by intrinsic 
characteristics of the desired computation. This can be confusing because the language used to 
specify a layer’s computation (e.g., using loop nests as we did here) might have the same syntax as 
the language used to described a mapped version of the computation, and thus the specification 
appears to define a visit order. Using a more general notation like the tensor index notation 
(see Section 2.3.1), which doesn’t define a visit order, can help. In other cases, the presence of 
tiling can be a hint that one is looking at a mapping, but sometimes context is the only way to 
distinguish between a problem specification and a mapped computation. 

A specific visit order through the iteration space does, however, correspond to a specific 
mapping, including its dataflow. For example, a mapping of a weight-stationary dataflow would 
be characterized by the iteration space visit order in Figure 6.5. In this case, the sequence of 
points to visit in the iteration space can be determined by following the values of the variables 
in a weight-stationary loop nest, such as Design 6.3. 


TTo make things clearer in this example, all the indices used to access the data are included at each point in the iteration 
space, even though any two could be used to compute the third. 
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Figure 6.4: Iteration space for a 1-D convolution — each yellow oval represents a point in the 
iteration space corresponding to a calculation at the (w,s,q) point in the space. The values of 
the input and output operands from the i[], £ [], and o[] data arrays are determined by those 
coordinates. For example, the iteration point (w,s,q) = (4, 1, 3) accesses values at i [4], f [1], 
and o[3]. 


As is illustrated by this example, it should be evident that many useful traversal orders can 
be specified by the loop nests described in Chapter 5. In those loop nests, the iteration variables 
are used directly (or through very simple functions) as the indices of the data tensors. There are, 
however, traversal sequences that do not precisely follow the loop nest pattern and these more 
complex patterns might be created through the use of more complex computations on the for 
loop variables to create references to the data tensors. ‘The result of such traversals would be a 
more diverse set of reference patterns to the data tensors. The benefits and hardware costs of 
such traversal orders are still largely an open question. 

Parallelism in the hardware can be represented in the iteration space as multiple simultane- 
ous traversal paths through the nodes in the iteration space. The paralle1-for loops introduced 
in Chapter 5, would naturally result in such multiple traversals. 

One interesting phenomenon associated with multiple simultaneous traversals through 
the iteration space is that the number of parallel traversals might not equal the number of hard- 
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Figure 6.5: Traversal (green arrows) of the iteration space of a 1-D convolution for a weight- 
stationary dataflow. 


Design 6.3 1-D Convolution with weight-stationary visit order 





i = Array(W) # Input feature map 
f = Array(S) # Filter weights 
o = Array(Q) # Output feature map 


for s in [0, S): 
for q in [0, Q): 
w=qits 


olq] += ilw] * f[s] 





ware units (e.g., PEs) or the lengths of the traversals might not be all be equal. These phenomena 
can be the consequence of hardware constraints that result in the folding and repeating mapping 
patterns described in the previous chapter in Section 5.7.4. In terms of loop nests, such situa- 
tions arise when a problem shape parameter (e.g., partial sum width Q) is not factored perfectly 
(i.e., imperfect factorization) into the per layer loop limits (e.g., Q2, Q1, and QO). In any case, 
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imperfect factorization may result in underutilization of hardware resources, and hence lower 
performance. It is the job of the mapper to decide when an imperfect factorization, despite 
lower utilization, is the right choice to optimize the mapping. 

In a DNN accelerator with a hierarchy of storage levels and interconnect networks, the 
iteration space will also have a hierarchical structure with iteration space points corresponding 
to the data movement actions and computation throughout the hierarchy. This hierarchy also 
needs to be considered in combination with the existence of simultaneous traversal paths intro- 
ducing opportunities for inter-layer and intra-layer data transfers. And if that is not complicated 
enough, there is the further opportunity for the execution of those distinct paths to be skewed in 
time. This time skew can result in changes to the data reference and communication patterns at 
all levels of the storage hierarchy. The consideration of these issues is a area of current research 
and is, unfortunately, well beyond the scope of this book. 

In conclusion, given all the above considerations, rules for specifying the allowable visit 
orders in the iteration space, accounting for allowable dataflows and other constraints, can be 


used to specify all the legal mappings (i.e., the mapping space). 


6.3.2 MAPPER SEARCH 


After creating a map space, the responsibility of the mapper turns to the optimizing search step in 
Figure 6.3. The aim of the optimizing search step is to find a mapping that optimizes for some 
objective such as performance, energy, or energy/delay product. 

‘The search can be conducted by picking a mapping from the map space, and then deter- 
mining how well that mapping meets the optimization objectives. That determination is typi- 
cally conducted by a performance and/or energy model. In addition to a mapping, those models 
require the DNN problem specification/shape, the dataflow(s)/constraints of the DNN accel- 
erator, and the micro-architectural implementation details in order to project the performance 
and/or energy consumption for that mapping. 


6.3.3 MAPPER MODELS AND CONFIGURATION GENERATION 


Performance modeling approaches for DNN accelerators can range from detailed cycle-level 
simulation to simple analytic models. In addition, in a recursive application of DNN processing, 
the model can be implemented as a neural network, which predicts performance. 

The energy cost for a DNN accelerator can be estimated using an architecture-level energy 
estimation methodology such as Accelergy [187] or with an analytic model for energy projec- 
tions such as is described next in Section 6.4. 

Interpreting that result and deciding on the search pattern is search algorithm specific, 
and beyond the scope of this discussion. But after finding a “best” mapping, that mapping is 
converted into a configuration for the DNN accelerators by the mapping-to-configuration step of 
the mapper. 
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6.4 ANALYSIS FRAMEWORK FOR ENERGY EFFICIENCY 


Unlike conventional compilers, which typically only focus on optimizing performance, a map- 
per often needs to optimize for energy efficiency. Conducting such an optimization typically 
requires a model capable of projecting the energy consumption of a particular mapping on a 
DNN accelerator. In this section, we will introduce a framework for the evaluation of energy 
consumption of DNN accelerators based on a spatial architecture. The analysis methodology is 
lightweight yet general, such that it can be applied to the analysis of many DNN accelerator 
architectures. 

The way each MAC operation fetches inputs (filter weights and input activations) and 
accumulates partial sums introduces different energy costs due to two factors: 


e how the dataflow exploits input data reuse and partial sum accumulation scheduling; and 


e fetching data from different storage elements in the DNN accelerator have different energy 
costs. 


‘The goal of an energy-efficient dataflow is then to perform most data accesses using the data 
movement paths with lower energy cost. This is an optimization process that takes all data ac- 
cesses into account, and will be affected by the layer shape and available hardware resources. 

In this section, we will describe a framework that can be used by the search step in a map- 
per to optimize the dataflows for spatial architectures in terms of energy efficiency. Specifically, 
it defines the energy cost for each level of the storage hierarchy in the DNN accelerator. Then, 
it provides a simple methodology to incorporate any given dataflow into an analysis using this 
hierarchy to quantify the overall data movement energy cost. This allows for a search for the 
optimal mapping for a dataflow that results in the highest energy efficiency for a given DNN 
layer shape. 


Data Movement Hierarchy: We assume a spatial architecture that provides four levels of storage 
hierarchy. Sorting their energy cost for data accesses from high to low, it includes DRAM, 
global buffer, NoC, and RF. Fetching data from a higher-cost level to the ALU incurs higher 
energy consumption. Also, the energy cost of moving data between any of the two levels is 
dominated by the one with higher cost. Similar to the energy consumption quantification in 
previous experiments [121, 188, 189], Figure 6.6 shows the normalized energy consumption of 
accessing data from each storage level relative to the computation of a MAC at the ALU. The 
numbers are extracted from a commercial 65 nm process. 


Analysis Methodology: Given a dataflow, the analysis is formulated in two parts: (1) the input 
data access energy cost, including filter weights and input activations; and (2) the partial sum 
accumulation energy cost. The energy costs are quantified through counting the number of ac- 
cesses at each level of the previously defined hierarchy, and weighting the accesses at each level 
with a cost from Figure 6.6. The overall data movement energy of a dataflow is obtained through 
combining the results from the two types of input data and the partial sums. 
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Figure 6.6: Normalized energy cost relative to the computation of one MAC operation at ALU. 
Numbers are extracted from a commercial 65 nm process. 


6.4.1 INPUT DATA ACCESS ENERGY COST 


If an input data value is reused for many operations, ideally the value is moved from DRAM 
to RF once, and the ALU reads it from the RF many times. However, due to limited storage 
and operation scheduling, the data is often kicked out of the RF before exhausting reuse. The 
ALU then needs to fetch the same data again from a higher-cost level to the RF. Following this 
pattern, data reuse can be split across the four levels. Reuse at each level is defined as the number of 
times each data value is read from this level to its lower-cost levels during its lifetime. Suppose the total 
number of reuses for a data value is a x b x c x d, it can be split into reuses at DRAM, global 
buffer, array and RF for a, b, c, and d times, respectively. An example is shown in Figure 6.7, 
in which case the total number of reuse, 24, is split into a = 1, b = 2, c = 3, and d = 4. The 
energy cost estimation for this reuse pattern is: 


a x EC(DRAM) + ab x EC(global buffer)+ 


(6.1) 
abc x EC(array) + abcd x EC(RF), 


where EC(-) is the energy cost from Figure 6.6.8 


6.4.2 PARTIAL SUM ACCUMULATION ENERGY COST 


Partial sums travel between ALUs for accumulation through the four-level hierarchy. In the 
ideal case, each generated partial sum is stored in a local RF for further accumulation. However, 
this is often not achievable due to the overall operation scheduling, in which case the partial 
sums have to be stored to a higher-cost level and read back again afterward. Therefore, the total 
number of accumulations, a x b x c x d, can also be split across the four levels. The number of 
accumulations at each level is defined as the number of times each data goes in and out of its lower- 
cost levels during its lifetime. An example is shown in Figure 6.8, in which case the total number 
of accumulations, 36, is split into a = 2, b = 3, c = 3, and d = 2. The energy cost can then be 


8 Optimization can be applied to Equation (6.1) when there is no reuse opportunity. For instance, if d = 1, the data is 
transferred directly from a higher level to the ALU and bypasses the RF, and the last term in Equation (6.1) can be dropped. 
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Figure 6.7: An example of the input activation or filter weight being reused across four levels of 
the memory hierarchy. 


estimated as 


(2a — 1) x EC(DRAM) + 2a(b — 1) x EC (global buffer)+ 


ab(c — 1) x EC (array) + 2abc(d — 1) x EC(RF). (a2) 


‘The factor of two accounts for both reads and writes. Note that in this calculation the accumu- 
lation of the bias term is ignored, as it has negligible impact on the overall energy. 


6.4.3 OBTAINING THE REUSE PARAMETERS 


For each mapping, there exists a set of reuse parameters (a, b, c, d) for each of the three data 
types, i.e., input activations, filter weights, and partial sums. These parameters are a function of 
the variables in the loop limits of the dataflow. For example, the set of reuse parameters for the 
simple 1-D convolution dataflow shown in Figure 6.9 is a function of Q2, Q1, QO and $2, S1, 
S0, and can be summarized in Table 6.1 (ignoring halos on the edges of the convolution). 
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Figure 6.8: An example of the partial sum accumulation going through four levels of the memory 
hierarchy. 


65 EYEXAM: FRAMEWORK FOR EVALUATING 
PERFORMANCE 


‘Thus far in this chapter, we have studied mapping as a relatively heavy-weight process to find 
a configuration for a particular workload on a completely specified DNN accelerator design. 
Sometimes, it is important to be able to understand mapping and performance projections earlier 
in the design process or simply to gain insights into what is limiting performance. The Eyexam 
framework is a step-by-step process for doing just that. 

Eyexam provides a systematic way of understanding the performance limits for DNN 
processors as a function of specific characteristics of the workload (i.e., DNN model) and ac- 
celerator design (i.e., architecture and micro-architecture); it applies these characteristics as se- 
quential steps to increasingly tighten the bound on the performance limits. Specifically, instead 
of comparing the overall performance of different designs, which can be affected by many non- 
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Figure 6.9: An example dataflow for a 1-D convolution. 


Table 6.1: Reuse parameters for the 1-D convolution dataflow in Figure 6.9 


Reuse Parameters 


hes aan HO Sal 


Input activation S/(S0 x S1 x $2) 


Weight Q/(Q0 x Q1 x Q2) 
Partial sums S/(S0 x S1 x $2) 























architectural factors such as system setup and technology differences, Eyexam provides a step- 
by-step process that associates a certain amount of performance loss to each architectural design 
decision (e.g., dataflow, number of PEs, NoC, etc.) as well as the properties of the workload, 
which for DNNs is dictated by the layer shape and size (e.g., filter shape, feature map size, batch 
size, etc.). 

Eyexam focuses on two main factors that affect performance: (1) the number of active PEs 
due to the mapping as constrained by the dataflow; and (2) the utilization of active PEs, i.e., 
percentage of active cycles for the PE, based on whether the NoC has sufficient bandwidth to 
deliver data to PEs to keep them active. The product of these two components can be used to 
compute the utilization of the PE array as follows: 


utilization of the PE array = number of active PEs x utilization of active PEs. (6.3) 


Later in this section, we will see how this approach can use an adapted form of the well-known 
roofline model [119] for the analysis of DNN processors. 

We will perform this analysis on a generic DNN processor architecture based on a spatial 
architecture that consists of a global buffer and an array of PEs. Each PE can have its own register 
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file (RF) and control logic, and the PE array communicates with the global buffer through the 
NoCs. Separate NoCs are used for the three data types. 

As described in Chapter 5, the dataflow of a DNN processor is one of the key attributes 
that define its architecture [184]. We will feature architectures that support the following four 
popular dataflows [142, 159]: weight stationary (WS), output stationary (OS), input stationary 
(IS), and row stationary (RS). 

To help illustrate the capabilities of Eyexam, we will re-examine the simple 1-D convolu- 
tion example in Section 6.5.1, and walk through the key steps of Eyexam in Section 6.5.2 with 
the 1-D convolution. We will then highlight various insights that Eyexam gives on real DNN 
workloads and architectures in Section 6.5. 


6.5.1 SIMPLE 1-D CONVOLUTION EXAMPLE 


We will re-examine the simple 1-D convolution example. This example illustrates the two com- 
ponents of the problem. The first is the work/oad, which is represented by the shape of the layer 
for a 1-D convolution. This comprises the filter size S and the input feature map size W and 
the output feature map size Q. The second is the architecture of the processing unit, for which a 
key characteristics is the dataflow shown in Figure 6.9. In this example, the two parallel-fors 
represent the distribution of computation across multiple PEs (i.e., spatial processing); the inner 
two for loops represent the temporal processing and RF accesses within a PE, and the outer 
two for loops represent the temporal processing of multiple passes across PE array and global 
buffer (GLB) accesses. For this example, we assume the input activations and weights fit in the 
GLB, i.e., the reuse parameter a is 1 for both the input activation and weight. 

A mapping assigns specific values to loop limits Q0, Q1, Q2 and S0, S1, S2 to execute 
a specific workload shape and loop ordering. ‘This assignment of Q0, Q1, Q2 and S0, S1, $2 is 
constrained by the shape of the workload and the hardware resources. The workload constraints 
in this example are Q0 x Q1 x Q2 = Q and S0 x S1 x S2 = S.’ The architectural constraint 
in this example is that Q1 x S1 must be less than the number of PEs (later we will see that 
the NoC can impose additional restrictions). The size of the RF allocated to input activations, 
partial sums and weights will restrict Q0 and S0, and the space in the GLB allocated to partial 
sums restricts Q1 and S1. 

While this is a simple 1-D example, it can be extended to additional levels of buffering 
by adding additional levels of loop nest (Section 5.6). Furthermore, extending it to support 
additional dimensionality (e.g., 2-D and channels) will also results in additional loops. 


°We assume perfect factorization in this example. Imperfect factorization will lead to cycles where no work is done, as 
discussed in Section 5.5. 
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6.5.2 APPLY PERFORMANCE ANALYSIS FRAMEWORK TO 1-D 
EXAMPLE 


The goal of Eyexam is to provide a fine-grain performance profile for an architecture. It is a se- 
quential analysis process that involves seven major steps. The process starts with the assumption 
that the architecture has infinite processing parallelism, storage capacity and data bandwidth. 
‘Therefore, it has infinite performance (as measured in MACs/cycle). 

For each of the following steps, certain constraints will be added to reflect changes in the 
assumptions on the architecture or workload. The associated performance loss can therefore be 
attributed to that change, and the final performance at one step becomes the upper-bound for 
the next step. 

Step 1 (Layer Shape and Size): In this first step, we look at the impact of the workload 
constraint, specifically the layer shape (S, W, and Q), assuming unbounded values for S1 and 
Q1 since there is no architectural constraints. This allows us to set S1 = S, Q1 = QO, and Q2 = 
Q0 = 1, $2 = S0 = 1, so that there is all spatial (i-e., parallel) processing, and no temporal (i.e., 
serial) processing. Therefore, the performance upper bound is determined by the finite size of 
the workload (i.e., the number of MACs in the layer, which is Q x S). 

Step 2 (Dataflow): In this step, we define the dataflow and examine the impact of this 
architectural constraint. For example, to configure the example loop nest into a weight-stationary 
dataflow, we would set Q1 = 1, Q0 = O and S1 = S, SO = 1. This means that each PE stores 
one weight, that weight is reused QO times within that PE, and the number of PE equals the 
number of weights. This forces the absolute maximum amount of reuse for weights at the PE. 
‘The forced serialization of Q0 = Q reduces the performance upper bound from Q x S to S, 
which is the maximum parallelism of the dataflow. 

Step 3 (Number of PEs): In this step, we define a finite number of PEs, and look at the 
impact of this architectural constraint. For example, in the 1-D WS example, where Q1 = 1 
and Q0 = Q, S1 is constrained to be less than or equal to the number of PEs, which dictates 
the theoretical peak performance. As hinted in Section 5.5, there are two scenarios when the 
actual performance is less than the peak performance. The first scenario is called spatial mapping 
fragmentation, in which case S, and therefore S1, is smaller than the number of PEs. In this case, 
some PEs are completely idle throughout the entire period of processing. The second scenario is 
called temporal mapping fragmentation, in which case S is larger than the number of PEs but 
not an integer multiple of it. For example, when the number of PEs is 4, S = 7, and S1 = 4, it 
takes two cycles to complete the processing, and none of the PEs are completely idle. However, 
one of the 4 PEs will only be 50% active. Therefore, it still does not achieve the theoretical peak 
performance. In general, however, if the workload does not map into all of the PEs in all cycles, 
then some PEs will not be used at 100%, which should be taken into account in performance 
evaluation. 

Step 4 (Physical dimensions of the PE array): In this step, we consider the physical dimen- 
sions of the PE array (e.g., arranging 12 PEs as 3x4, 2x6, or 4x3, etc.). The spatial partitioning 
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is constrained per dimension which can cause additional performance loss. To explain this step 
with the simple example, we need to relax the WS restriction. Let us assume Q1 is mapped to 
the width of the 2-D array and S1 is mapped to the height of the 2-D array. If Q1 is less than 
the width of the array or S1 is less than the height of the array (spatial mapping fragmentation), 
not all PEs will be utilized even if without the constraint that Q1 x S1 is smaller or equal to the 
number of PE. A similar case can be constructed for the temporal mapping fragmentation as 
well. This architectural constraint further reduces the number of active PEs. 

Step 5 (Storage Capacity): In this step, we consider the impact of making the buffer storage 
finite. For example, for the WS dataflow example, if the allocated storage for partial sums in the 
GLB is limited, it limits the number of weights that can be processed in parallel, which limits 
the number of PEs that can operate in parallel. Thus, an architectural constraint on how many 
partial sums can be stored in the GLB restricts Q1 and S1, which again can reduce the number 
of active PEs. 

Step 6 (Data Bandwidth): In this step, we consider the impact of a finite bandwidth for 
delivering data across the different levels of the loop nest (i.e., memory hierarchy). The amount 
of data that needs to be transferred between each level of the loop nest and the bandwidth at 
which we can transmit the data dictate the speed at which the index of the loop can increment 
(i.e., number of cycles per MAC). For instance, the bandwidth of the RF in the PE dictates the 
increment speed of s0 and q0, the bandwidth of the NoC and GLB dictates the rate of change 
of s1 and q1, and the off-chip bandwidth dictates the rate of change of s2 and q2. 

To quantify the impact on performance from insufficient bandwidth, we can adapt the 
well-known roofline model [119] for the analysis of DNN processors. The roofline model, as 
shown in Figure 6.10, is a tool that visualizes the performance of an architecture under various 
degrees of operational intensity. It assumes a processing core, e.g., PE array, that has insuff- 
cient local memory to fit the entire workload, and therefore its performance can be limited by 
insufficient bandwidth between the core and the memory, e.g., GLB. When the operational 
intensity is lower than that at the inflection point, the performance will be bandwidth-limited; 
otherwise, it is computation-limited. The roofline indicates the performance upper-bound, and 
the performance of actual workloads sit in the area under the roofline. 

For this analysis, we adapt the roofline model as follows: 


e We use three separate rooflines for the three data types instead of one with the aggregated 
bandwidth and operational intensity.'° This helps to identify the performance bottleneck 
and is also a necessary setup since independent NoCs are used for each data type. However, 
the performance upper-bound will be the worst case of the three rooflines. 


* The roofline is typically drawn with the peak performance of the core and the total band- 
width between the core and memory. However, since we have gone through the first 5 steps 
in Eyexam, it is possible to get a tighter bound (Figure 6.11). The leveled part of the roofline 


10Tdeally, we should draw a roof-manifold with the operational intensity of each data type on a separate axis; unfortunately, 
it will be a 4-D plot that cannot be visualized. 
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Figure 6.10: The roofline model. 


is now at the performance bound from step 5; the slanted part of the roofline should only 
consider the bandwidth to the active PEs for each data type. Since performance is mea- 
sured in MACs/cycle, the bandwidth should factor in the clock rate differences between 
processing and data delivery. 


For a workload layer, the operational intensity of a data type is the same as its amount 
of data reuse in the PE array, including both temporal reuse with the RF and the spatial 
reuse across PEs. It is measured in MACs per data value (MAC/data) to normalize the 
differences in bitwidth. 


Step 7 (Varying Data Access Patterns): In this step, we consider the impact of bandwidth 
varying across time due to the dynamically changing data access patterns (Step 6 only addresses 
average bandwidth). For the WS example, during ramp up, the weight NoC will require high 
bandwidth to load the weights into the RF of the PEs, but in steady state, the bandwidth re- 
quirements of the weight NoC will be low since the weights are reused within the PE. The 
performance upper bound will be affected by ratio of time spent in ramp up versus steady state, 
and the ratio of the bandwidth demand versus available bandwidth. This step causes the perfor- 
mance point to fall off the roofline, as shown in Figure 6.11. There exist many common solutions 
to address this issue, including using double buffering or increased bus width for the NoC. 

Table 6.2 summarizes the constraints applied at each step. While Eyexam is useful for 
examining the impact of each step on performance, it can also be used in the architecture design 
process to iterate through a design. For instance, if one selects a dataflow in Step 2 and discovers 
that the storage capacity in Step 5 is not a good match causing a large performance loss, one could 
return to Step 2 to make a different dataflow design choice and then go through the steps again. 
Another example is that double buffering could be used in Step 7 to hide the high bandwidth 
during ramp up, however, this would require returning to Step 5 to change the effective storage 
capacity constraints. Eyexam can also be applied to consider the trade-off between performance 


140 6. OPERATION MAPPING ON SPECIALIZED HARDWARE 
(MAC/cycle) Slope = BW to only active PE 














A ORAINA EEIEIE AETA —> Step 1: maximum workload parallelism 


EEPE ATTE, ij Step 2: maximum dataflow parallelism 
Kani —> Number of PEs 
sslscbvaasitestefp caveceesl ecddasiscevescseeeaceeeenccees =< Step 3: # of active PEs under a finite PE array size 


Peak 
Perf. 


I ekeceasce  cctccdod tops ceeccesascceuatecessaies —> Step 4: # of active PEs under fixed PE array dimensions 
an fara Maa —> Step 5: # of active PEs under fixed storage capacity 
A —> Step 6: lower active PE utilization due to insufficient avg. BW 
: S es —> Step 7: lower active PE utilization due to insufficient inst. BW 


(MAC/data) 
Workload Operational Intensity 


Figure 6.11: Impact of steps on the roofline model. 


and energy efficiency in combination with the framework for evaluating energy efficiency as 
discussed in Section 6.4, as well as consider the impact of sparsity and workload imbalance on 
performance. 


6.6 TOOLS FOR MAP SPACE EXPLORATION 


‘There are a number of efforts that have developed tools to address the DNN accelerator mapping 
problem. Some well-known mappers are: the Eyeriss mapper [184], MAESTRO [190], R- 
Stream [191], Timeloop [192], and TVM [130]. Other notable tools that address a similar 
problem, but were not originally intended to target DNN accelerators are: Halide [193] and 
Tiramisu [194]. Halide introduced the separation of algorithm from schedule (or mapping) 
and Tiramisu, although targeted only at CPUs and GPUs, is particularly interesting because it 
optimizes sparse computations, which is discussed in Chapter 8. 

Mappers and mapper-like tools can typically be characterized by how they approach the 
following issues: (1) The range of architectures they target. (2) The range of computations they 
can map (e.g., just DNN layers or more general computations). This is generally dictated by 
the framework they use to describe the map space (e.g., distinguish legal and illegal mappings). 
‘These frameworks also usually facilitate the analysis of the behavior of the computation on the 
specified architecture. (3) The heuristic they use to search the map space. (4) The technique they 
use to predict the performance and/or energy for a given mapping. Below, we will characterize 
the approaches used by some popular tools. Note, however, that many of these tools are still 
evolving rapidly, and therefore this characterization is just a snapshot as of the time this is being 
written. 

There is a considerable range of architectures that existing mappers target. Some target 
only a specific DNN accelerator. An example is the Eyeriss mapper, which only targets Eye- 
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Table 6.2: Summary of steps in Eyexam 
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riss’ row-stationary dataflow. In order to support a wider range of target designs others accept a 
template that describes a DNN accelerator architecture.'! These templates include factors such 
as the dataflows supported, the number of PEs and levels of storage hierarchy and the storage 
sizes and network connectivity. The MAESTRO, Timeloop, and TVM mappers all accept such 
templates. Some tools, (e.g., TVM) extend to target general purpose processors by characteriz- 
ing certain hardware functionality (e.g., vectors or Nvidia's tensor cores) as highly constrained 


templates. 


The range of problem specifications accepted by these tools also varies. Some, like the 
Eyeriss mapper, assume the standard expression for CONV/FC layers (see Chapter 2). Others, 
accept expressions equivalent to the tensor index expressions described in Chapter 2. These in- 


11 Architecture templates will typically be expressed in a human-readable configuration language, such as YAML. 
p. typically p 8 guagi 
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clude Tiramisu and Timeloop. TVM accepts problem specifications generated by a variety of 
standard machine learning libraries, such as Tensorflow [98] and PyTorch [99]. Finally, Halide 
operates as an embedded domain specific language (EDSL) in C++. 

Another requirement of a mapper is having a way to represent the map space. Some use 
a well-known compiler framework called the polyhedral model [128]. This model allows one to 
represent sequences of loop nests that form what are called static control parts or SCoPs. Roughly, 
a SCoP is a sequence of loop nests where the loops can have dynamic loop bounds and where 
the statements in the body of the loops can have conditions that control whether the statement 
executes or not. However, the loop bounds and conditions must be linear functions of the loop 
index variables and constant parameters (i.e., they are not data dependent). Polyhedral models 
are often paired with an analysis framework that computes behavioral characteristics like reuse 
for specific architectures. Tools that use a polyhedral model include R-Stream and Tiramisu. 

So far as we know, there is no polyhedral-based cost model for some of the more complex 
characteristics of DNN accelerators. These more complex characteristics include multi-level ex- 
plicit decoupled data orchestration-style (EDDO) buffering (see Section 5.8) or complex NoC 
topologies, such as those with support for multi-level multicast or spatial reduction. Further- 
more, existing frameworks that implement the polyhedral model are quite heavyweight and can 
be overkill for some situations. Therefore, some mappers use custom representations for the map 
space. These often support a more restricted set of loop nests (e.g., no conditionals or fixed loop 
bounds). Examples of such mappers are: the Eyeriss mapper, Halide, Timeloop, and TVM. 
‘These systems tend to determine data reuse using a custom analysis that tracks the movement 
of data tiles, which is described in Section 6.4. 

Search heuristics vary greatly. Some current approaches include: genetic algorithm-based 
search (Eyeriss mapper), beam search (Halide, Tiramisu), parallel random search of legal map- 
pings (Timeloop), and parallel simulated annealing (TVM). 

‘There are two major approaches to projecting the performance and/or energy consumption 
for a mapping: deep-learning based and analytical.!* To use a deep learning-based approach, a 
DNN model is trained to project operational metrics for a mapping on a given architecture. 
Halide, Tiramisu and TVM use this approach. Such models are relatively fast and accurate. 
However, deep learning-based models tend to only be useful for the architectures they were 
trained on and are not amenable to design space exploration because the training does not reflect 
the characteristics of designs that were not in the training set (i.e., those one would like to 
discover). Also, most works focus on projecting performance rather than energy. 

To allow more flexibility in the range of architectures supported and to project energy 
(as well as performance) some tools use custom analytical models. Such tools include the Ey- 
eriss mapper (which uses the model described in Section 6.4), MAESTRO, and Timeloop. 
Timeloop is paired with a discrete energy evaluation framework, Accelergy [187], which al- 


124 cycle-level simulation approach would almost certainly be too slow for projecting metrics for a mapping during a 
search. 
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Figure 6.12: Timeloop [192] with integration of Accelergy [187] as energy estimation model. 
Timeloop sends projected action counts for a mapping to Accelergy and receives an energy 
estimation to guide its search. Accelergy plug-ins allow for customization of component energy 
estimation. These tools are available at http://accelergy.mit.edu/tutorial.html. 


lows for template-based descriptions of the architecture and component energy costs, as shown 
in Figure 6.12. This later capability is especially useful for understanding the impact of new 
technologies like those discussed in Chapter 10. 

In summary, there are a variety of mapper (or mapper-like) tools that target a variety of 
DNN accelerator architectures to find an optimal mapping. They typically use some variant of a 
loop nest (or sequence of loop nests) to describe the mappings in the map space and use analysis 
tools to project performance and/or energy to guide a search of the map space to find an optimal 


mapping. 


PART III 


Co-Design of DNN Hardware and 
Algorithms 
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CHAPTER 7 


Reducing Precision 


As highlighted in the previous chapters, data movement dominates energy consumption and 
can affect the throughput for memory-bound systems. One way to address this issue to reduce 
the number of bits (bit width) required to represent the weights and activations of the DNN 
model. Using fewer bits per weight and/or activation effectively reduces the number of unique 
values that they can take on and thus is often referred to as reducing precision. Benefits of reduced 
precision can include reduced data movement (i.e., reduce memory bandwidth), reduced stor- 
age cost (i.e., reduce chip area), reduced energy per memory access (due to smaller memories), 
and reduced energy and time per MAC operation. However, reducing precision can also affect 
accuracy and thus its impact on accuracy must be carefully evaluated. 

Therefore, the overarching goal in reducing precision is to minimize number of bits while 
also maintaining accuracy and minimizing any hardware overhead. In this chapter, we will dis- 
cuss various forms of quantization, which involves mapping a larger set of values to a smaller 
set of values. Thus, quantization can be applied to reduce precision. We will discuss various 
design considerations when deciding the number of unique values (and number of bits) to al- 
low per weight and/or activation (which affects the accuracy), the relationship between these 
values (which affects how they are computed upon and stored—i.e., hardware overhead), and 
whether these properties (i.e., number of values and relationship between them) are allowed 
to vary across different parts of the DNN model (e.g., layers and/or filters), which can further 
reduce the number of bits at the cost of hardware overhead to support this flexibility. 


7.1 © BENEFITS OF REDUCE PRECISION 


Reducing the number of bits per weight and/or activation (and consequently partial sum) has 
several benefits. 

First, it reduces the amount of data movement resulting in lower energy consumption. 
Reducing data movement can also increase throughput since it reduces memory bandwidth re- 
quirements, which reduces the likelihood that processing elements (PEs) become idle while 
waiting for data. 

Second, it reduces the amount of storage required for a given number of weights, activa- 
tions, and/or partial sums. This can be exploited in two ways. 


e It can reduce the amount of on-chip memory, which reduces the area cost of the chip; 
alternatively, for a fixed chip area, more area can be allocated to the PEs to allow for more 
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Figure 7.1: The area and energy cost for additions and multiplications at different precision, and 
memory accesses in a 45 nm process. ‘The area and energy scale different for multiplication and 
addition. The energy consumption of data movement (red) is significantly higher than arithmetic 
operations (blue). (Figure adapted from [121].) 


parallel compute, which can help increase throughput. In addition, smaller memories tend 
to consume less energy. 


e It can keep the same amount of on-chip memory, but now each memory can store more 
weights, activations and/or partial sums, which can potentially allow for more data reuse 
and thus reduce the amount of off-chip data movement (and on-chip data movement 
between the different levels of the memory hierarchy). 


Finally, it reduces the cost of the MAC hardware. The cost of multiplication and accumula- 
tion scales differently with bit width. The area and energy cost of a multiplier scales quadratically 
O(n?) with bit width (n). The delay of the critical path of a multiply typically scales linearly O(n) 
with bit width; therefore, reducing the number of bits per input operand (i.e., weights and acti- 
vations) can also help increase throughput. ‘The energy and area cost of an accumulator (adder) 
scales linearly O(n) with bit width, while the delay of the critical path can scale either linearly 
O(n) or square-root O(./n) depending on the implementation [195, 196]. Figure 7.1 shows 
the energy and area cost for multipliers and adders at different precisions. From this figure, we 
observe that multiplications are more expensive than additions, and reading from memory (data 
movement) is more expensive than both; therefore, reducing data movement is a very important 
benefit of reducing precision. 

Note that the bit widths of integer multiplication and accumulation are different within 
a MAC than in storage, as shown in Figure 7.2. The bit width of the multiplier is dictated 
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Figure 7.2: Various bit widths in a multiply and accumulate (MAC). 


by the bit width of the filter weights (nf) and input activations (n;). To be correct to the last 
bit, the bit width of the output (product) of the multiplier must be nf + ni. These products 
are then accumulated across the size of the filters in the DNN model (i.e., C x R x S). To be 
correct to the last bit, the bit width of the accumulator that generates the partial sum must be 
ng + ni + log, (C x Rx S), where C x R x S is the maximum number of weights in a filter 
for the DNN model.' After the partial sums are fully accumulated, they are typically reduced to 
the bit width of the activations (n;) as the accumulated output will eventually become an input 
activation to the next layer. Reducing the bit width of the fully accumulated partial sum will 
not have a significant impact on accuracy if the distribution of the weights and activations are 
centered near zero with a limited standard deviation; batch normalization may help achieve this 
effect. Since the bit width of the final output activation will ultimately be less than the maximum 
possible partial sum, some implementations (e.g., processing-in-memory accelerators discussed 
in Chapter 10) will use reduced bit width computation (i.e., less than nf + n; + log,(C x R x 
S)) during the accumulation to reduce cost. 


7.22 DETERMINING THE BIT WIDTH 


‘The previous section highlighted the potential benefits of reducing precision and how the cost of 
a MAC operation scales with the bit width of the weights, activations, and partial sum. In this 
section, we will discuss the factors that affect how we determine the bit width for these various 
data types. 


7.2.1 QUANTIZATION 


The number of unique values determine the bit width of a given data type. The process of map- 
ping data to from a large number of possible values (full precision) to a reduced set of values 
(reduced precision) is referred to as quantization. 

Design decisions for quantization include (1) the number of values (often referred to as 
the number of quantization levels) that should be represented at reduced precision (which affects 


1For the popular CNN models described in Chapter 2, loga (C x R x S) is typically on the order of 10 to 16 bits. 
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Figure 7.3: Example of quantization (adapted from [137]): The input data x lies between 0 and 
16. Using the quantization function Q(-) shown in (a), x is quantized to X by mapping it to one 
of L = 4 possible quantized values (q;). In this example, Q(-) performs uniform quantization, 
which means that the quantized values are equally spaced out such that the four possible values 
that £ can take on are {2, 6, 10, 14}. Decision boundaries d; are used to decide the quantization 
value that x should be mapped to. The quantization error is computed as the mean squared error 
between x and X, i.e., E[(x — 2)°]. For the example sequence in (b), the quantization error is 
0.625. 


the bit width); and (2) the actual values (often referred to as the quantized values) that should 
be represented at the quantization levels (which is affected by the distribution of the data be- 
ing quantized), and the relationship between these values (which affects how computation is 
performed on the quantized values). 

‘The typical design goal for quantization is to reduce the number of values while at the 
same time reduce the quantization error, which is a measure of the average difference between 
the original full precision and the quantized reduced precision representation. In the context 
of DNNs, the quantization error can be used to estimate the impact on accuracy, although the 
ultimate goal is to reduce the impact of quantization on the outputs (i.e., output activations via 
the partial sums), rather than the inputs (i.e, input activations or weights). 

‘The quantization method and quantization error can be more precisely defined as follows: 

Let x denote the data at the original full precision. Quantization involves mapping data, 
x, to a smaller set of quantization levels, g;, where 0 < i < L — 1 and L is the number of levels; 
X denotes the quantized value of x as defined by X = Q(x), where Q(-) is a function that defines 
the quantization method as Q(x) = qj, for dj < x < di+1, where dj are the decision boundaries 
in the function. Thus, Q(-) defines how x is mapped to q;. An example of quantization is shown 
in Figure 7.3. 

The quantization error is measured between the quantized data and the original data. This 
error depends on the distribution of the data, the quantized values (q;), and the decision bound- 
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Figure 7.4: Quantization if distribution of x is not uniform. In this example, x has a Gaussian 
distribution (px(xo)). Accordingly, uniform quantization with L = 4 levels results in a quan- 
tization error of 0.5477, whereas with non-uniform quantization with L = 4 levels results in a 
quantization error of 0.1467. 


aries (d;). More formally, given that Xin < X < Xmax and the probability density function for 
x is Px (Xo), the optimal q; and d; that minimize the quantization error can be determined by 
solving 
Zmak 
min E[(x—%)*]= J ($ — xo) px(xo) dxo. (7.1) 
x, 


rj dj o=Xmin 


Therefore, the optimal quantization function Q(-) that minimizes the quantization error 
depends on the distribution of x. While uniform quantization, which has equal spacing between 
the quantized values (q;), can be operated on directly by standard ALU hardware, it is only 
optimal if x has a uniform distribution (i.e., all possible values of x have equal probability). De- 
signing a Q(-) that better matches the distribution of the original data x can either reduce the 
quantization error (see Figure 7.4), or maintain the same quantization error with fewer quanti- 
zation levels, which reduces the bit width of x. 

As it turns out, the distributions of the weights and activations in DNN models are not 
uniform [197, 198]. Accordingly, a popular approach to reduce the number of quantization levels 
is to use non-uniform quantization, where the spacing between levels varies. 

‘The relationship between the quantized values can be “computable,” where the relation- 
ship between the index (i) to the quantization value (q;) can be implemented using simple com- 
putation or logic. For instance, the quantized values can be assigned to be powers of two [199], 
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Figure 7.5: Various methods of quantization. (Figures adapted from [197, 199].) 


as shown in Figure 7.5b, where the mapping between the index and the quantized values can be 
performed with a simple shift. For this mapping, an additional benefit is that any multiplication 
with the quantized value can be replaced with a bit-shift [198, 200]. This is often referred to 
as computing in the log domain or log quantization, and the resulting DNN model is often 
referred to as “multiplier-free.” 

Alternatively, the relationship between the quantized values can be “non-computable,” 
where the relationship between the index (i) to the quantization value (q;) is unconstrained and 
thus requires the use of a hash or look up table. For instance, the quantized values (q;) can be 
learned from the data (Figure 7.5c), e.g., using k-means clustering to minimize the quantization 
error, and thus can take on any value. However, any multiplication with the quantization value 
becomes a three step process (see Figure 7.6): (1) determine the index of the quantized value; 
(2) perform a table look up [197] or compute a hash function [201] using the index to obtain 
the quantized value; and (3) perform the multiplication on the quantization value. 

Compared to the uniform quantization or non-uniform quantization that use computable 
quantized values (e.g., log quantization), non-uniform quantization that uses non-computable 
quantized values (e.g., learned quantization) can achieve a lower quantization error for a given 
number of quantization levels, or require fewer quantization levels for a given quantization er- 
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Figure 7.6: An example of performing a multiplication, where the weights are quantized using 
learned quantization. 


ror. However, this comes at a cost of increased complexity when performing computations on 
the quantized values. Therefore, selecting a method of quantization often involves a trade-off 
between implementation complexity, the number of quantization levels, and the quantization 
error (which can impact the accuracy of the DNNs). 

Regardless of the form of quantization, an additional benefit of reducing the number of 
unique values is that it effectively increases redundancy in the data, which can make approaches 
that exploit sparsity (as described in Chapter 8) more effective. In fact, quantization of the 
weights can also be thought of as a form of weight sharing as it reduces the number of unique 
weights. This can reduce the memory required to store the weights from CRSMx full precision 
to CRSM x log, L, as shown in Figure 7.6. In this example, if non-uniform quantization with 
non-computable quantized values is applied, the hardware overhead is an additional look up 
table of size Ux full precision. 

‘The previous example illustrates a cost of doing arithmetic operations on quantized val- 
ues. These costs vary depending upon the type of quantization. Values that undergo uniform 
quantization can be directly computed upon using conventional arithmetic hardware. Values 
that undergo non-uniform quantization typically need additional hardware to convert back to 
the uniform domain for computation. For non-uniform quantization that uses non-computable 
quantized values, a look up as in Figure 7.6 is typically used for the conversion. However, for 
non-uniform quantization that uses computable quantized values, arithmetic operations can 
sometimes be performed directly in the quantized domain such as the log domain; other times, 
a computation is required to convert back to the uniform domain. 

Another important characteristic of data distribution is the range of the data, which is the 
ratio of the largest and smallest non-zero value (magnitude) (i-e., Xmax/Xmin). A wide range may 
result in higher quantization error since the quantized values either need to be spread out to cover 
the wide range or overflow/underflow will occur (i.e., quantization will clip at the maximum and 
minimum values). While increasing the number of quantization levels can help reduce the error, 
it may be more advantageous to address a wide range by introducing the notion of a scale factor. 

Table 7.1 shows an example of both uniform and scale factor quantization for L = 16 
levels. Using a scale factor can be thought of as a form on non-uniform quantization, where the 
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Table 7.1: Example values for uniform and scale factor-based quantization with L = 16 quanti- 
zation levels. For uniform, the quantized values are q; = 44,000 x 1/16. For scale factor quanti- 
zation, the quantized values uniform within each scale as computed with q; = (4'/2+1 — 4'/) x 
(1%2)/8 + 44/2 — 1. 
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quantization is a function of the magnitude of the original full precision value. The benefit of us- 
ing such quantization is illustrated in Figure 7.7, where the quantization error for multiplication 
of values is much lower for the scale factor quantization than uniform quantization. 


7.2.2 STANDARD COMPONENTS OF THE BIT WIDTH 


Once we have determined the quantized values that we want to represent at reduced precision, 
we need to map them to a numerical representation in the form of bits. This can be done using 
the standard format for numerical representations, where the bit width is determined based on 
several factors: 


+ The range of the values that are represented. As discussed in Section 7.2.1, adding a scale 
factor can be used to better represent data with a large range. As a result, representing 
values with a large range (e.g., 10738 to 1078) often require more bits to support the scale 
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Figure 7.7: Comparison of the quantization error of scale factor quantization versus uniform 
quantization. (a) Quantization applied to inputs and output of multiplier. (b) Average quanti- 
zation error (E[(z — 2)*]) computed at output of multiplier across all multiplications from 0 x 
0 to max input x max input. 


factor as compared to values with a small range (e.g., 0 to 127). For instance, ne-bits can be 
used to scale values by a factor of 2°~!27, where e is the value represented by the ne-bits. 
Note that e can be either positive or negative, where a negative e allows representation 
of small values. In standard numerical representations that employ scaling, these bits are 
often referred to as the exponent. 


° The number of unique values that are represented for each scaling factor. As we increase 
the number of unique values, we will need more bits to index the values. For instance, 
Nm-bits can be used to represent 2”” values.” In standard numerical representations (e.g., 


IEEE Standard for Floating-Point Arithmetic (IEEE 754)), these bits are referred to as 


the mantissa. 


e Whether the values are signed or unsigned. Supporting signed values requires one extra 
bit (ns = 1) compared to unsigned values. Note, that negative numbers are often repre- 
sented as two’s complement, where the mantissa also changes (but nm does not change) 
when representing a negative number. 


Using the example from Section 7.2.1, L levels will require [logy L] bits. 
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1 8 23 
1 5 10 

FP16 6x10 - 6x104 
1 31 

Int32 0 - 2x10? 
Í 15 
1 7 

Int8 0-127 


Figure 7.8: Common numerical representations. (Figure adapted from [202].) 


Accordingly, the total number of bits required for common numerical representations 
is ns +Ne + nm. Figure 7.8 shows examples of common numerical representations and their 
ranges. 


Standard Numerical Representations 

To ensure compatibility between different computation systems, there are standards (e.g., IEEE 
Standard for Floating-Point Arithmetic (IEEE 754)) that specify the arithmetic formats of the 
numerical representations (including the number bits allocated for the mantissa (nm), the expo- 
nent (ne), and the sign (ns)), and methods to perform arithmetic on these formats. These formats 
are widely used on general computing applications are supported on general-purpose compute 
processors such as CPUs and GPUs. As such, much of the earlier work in DNN processing used 
these formats, and they tend to serve as a common baseline for comparison in reduce precision 
DNN research. We will briefly describe some of these formats. 

Fixed point format (also referred to as integer (int) format) has a fixed range and does not 
required any ne-bits. Figure 7.9a shows how an 8-bit fixed point (int8) can be represented by 
(—1)* x m, where s is the sign bit, m is the nm = 7-bit mantissa, and covers the range of 0 to 
127. 

Floating point (FP) format, refers to the case where range can be changed for each value, 
and thus includes ne-bits.* Figure 7.9a shows how 32-bit floating point is represented by (—1)* x 
m x 27127, where s is the sign bit, e is the value of the ne = 8-bit exponent, and m is the value 
of the nm = 23-bit mantissa, and covers the range of 10738 to 1038. 


3The name is derived from the fact that changing the scale can be thought of as moving the decimal point for each value. 
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(b) 8-bit fixed point example 


Figure 7.9: Example of different numerical representations. 


Figure 7.10 shows how additional hardware can be added on top of fixed point multiplier 
to make it a floating point multiplier; this explains the increase in area and energy for floating 
point multiplication that was observed in Figure 7.1. Various works have explored reducing the 
overhead of the additional hardware to support floating point (e.g., adders for the exponent 
update, and adders, shifters, and control logic for the normalizer) by sharing the exponent (i.e., 
scale factor) across multiple variables; this is often referred to as block floating point [203] or 
dynamic fixed point [204] as it can be viewed as a format that lies between floating point and fixed 
point. For instance, Microsoft’s Brainwave Project shares the exponent across a 128-element 
vector [168], while Intel Nervana Systems’ Flexpoint shares the exponent across all elements 
within a tensor (i.e., all weights in a filter, or all activations in feature map) [205]. The average 
number of bits for block floating point representation can also be reduced relative to conventional 
floating point, since ne-bits can be used across multiple variables. 

The default precision used on CPUs and GPUs is 32-bit floating point (fp32) (also re- 
ferred to as single precision), with ns = 1,n¢ = 8, nm = 23. In general-purpose platforms such 
as CPUs and GPUs, the main benefit of reducing precision is an increase in throughput; for in- 
stance, for the same memory bandwidth and within the same clock cycle, four 8-bit operations 
can be performed instead of one 32-bit operation. Accordingly, several commercial products 
that target deep learning have added support for reduced precision. This includes Nvidia’s Pas- 
cal [206] and Google’s TPU [207] which announced support for 8-bit fixed point for inference 
in 2016. In the following years, several products including Nvidia’s Volta, Google’s TPUv2, In- 
tel’s NNP-L announced support for 16-bit floating point for training. Accordingly, using 8-bit 
fixed point for inference and 16-bit floating point for training has become common practice. 
Note however that the above refers to the multiplier precision; in most work, the accumulation 
remains at 32-bit precision. 
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(a) Fixed point multiplier (b) Binary multiplication 
(figure adapted from [192]) 
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(c) Components of a floating point multiplier 


Figure 7.10: Fixed point versus Floating point multiplication. The fixed point multiplier in 
(a) performs the binary multiplication in (b). In the floating point multiplier, the mantissa (m4, 
mp) dictates the number of bits in the fixed point multiplier, contained within the floating point 
multiplier, highlighted in red. The exponent (e4, eg) dictates the number of shifters in the ex- 
ponent update. 


Custom Numerical Representations 

In order to further reduce the number of bits required to represent a given value, there has been 
extensive research on custom numerical representations. In particular, many works have explored 
the trade-off between allocating bits to the manitissa (nm) versus the exponent (ne) to reduce 
the overall number of bits required. 

For instance, 16-bit floating point (fp16) as defined by the IEEE floating point standard, 
also commonly referred to as half precision, uses ns = 1,n¢ = 5,Nm = 10, resulting in a range 
between ~ 5.9e78 to ~ 6.5e4. In contrast, the recently proposed “brain floating-point format” 
(bfloat16) [98] is also 16-bits, but distributes the bits as ns = 1,ne = 8,nm = 7, resulting in a 
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range between ~ 1e~38 to ~ 3e38.4 In other words, bfloat16 trades off fewer unique values for 
a larger range. Supporting the larger range is particularly useful for representing the gradients 
during training. In comparison, gradients can fall outside the range of fp16, and thus using 
the ne = 5 in fp16 (rather than ne = 8 in bfloat16 and fp32), would require loss scaling [98]. 
Training with bfloat16 can achieve the same state-of-the-art accuracy as 32-bit floating point 
(with the same number of iterations and without changing the hyperparameters) [208]. 

Another example of custom precision is “ms-fp8” and “ms-fp9” from Microsoft’s Brain- 
wave Project [209], which are 8- and 9-bit floating point formats with nm = 2 and nm = 3, 
respectively; both have ns = 1 and ne = 5. “ms-fp8” and “ms-fp9 provide a larger range than 
8-bit integer at a comparable area cost due to their smaller mantissa. This is particularly im- 
portant for Microsoft’s Brainwave Project, which aims to pack as many weights and MACs as 
possible onto an FPGA, so that it can deliver low-latency high-throughput DNN inference." 
In fact, the small mantissas results in a narrow bitwidth multiplications that can efficiently map 
to look up tables and DSPs (e.g., 5 narrow bitwidth multiplies can be mapped to the same 18 x 
19 multiplier in a DSP). It should be noted that custom numerical precision are well suited for 
FPGAs as the specialization can be applied during synthesis. 


7.3. MIXED PRECISION: DIFFERENT PRECISION FOR 
DIFFERENT DATA TYPES 


The optimal method of quantization depends on the distribution of the data. Therefore, one way 
to further reduce the required bit width (while maintaining accuracy) is to tailor the quantization 
method to the each of the different data types, which have different distributions. Using different 
precision on different data types is commonly referred to as mixed precision. 

For inference, the data types that need to be considered include weights, activations, and 
partial sums. For training, the data types that need to be considered include weights, activations, 
partial sums, gradients, and weight updates. Training typically requires higher precision than 
inference since the gradients and weights updates have a larger range then the weights and 
activations. This is the reason why inference can be performed with 8-bit fixed point while 
training requires 16-bit floating point.° 

Supporting mixed precision in custom hardware does not require any additional overhead 
if the precision for each data type is fixed. However, if the precision varies for different regions 


4The conversion between IEEE 32-bit floating point and bfloat16 is simple, as bfloat16 is simply fp32 with the matissa 
truncated from 23 to 7. With nm = 7, bfloat16 can also represent all 8-bit integers, which means int8 can be converted to 
bfloat16 without loss. 

>The low latency requirement means that no batching is used, which reduces the weight reuse and thus increases the 
off-chip memory bandwidth for weight reads. To address this, the weights are pinned (i.e., “hard coded” during synthesis) 
onto the on-chip memory of the FPGAs to reduce off-chip bandwidth. 

Sometimes for reduced precision training, replicas of the weights and activations will be stored at full precision, and be 
used during back propagation, while in forward propagation the reduced precision version will be used [157, 205, 210, 211]. 
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Figure 7.11: Example of a data-gated MAC. For simplicity, in this example only one input is 
precision scalable (weights). (Figure adapted from [212].) 


of the DNN model (e.g., different layers or different weights), then additional hardware support 
may be required, which will be discussed in the next section. 


7.4 VARYING PRECISION: CHANGE PRECISION FOR 
DIFFERENT PARTS OF THE DNN 


Just as different data types have different distributions, their distributions can also vary across 
different parts of the DNN (e.g., layer, filter, channels). Therefore, to even more aggressively 
reduce the bit width, the quantization method can adapt to the varying distribution. Allowing 
the precision to vary across the different parts of the DNN model is commonly referred to as 
varying precision. 

Although some systems simply build separate MAC units per precision, varying precision 
requires the use of a precision-scalable MAC in order to translate the reduced precision into 
improvement in energy-efficiency or throughput without a significant increase in area. A con- 
ventional approach is to use a data-gated MAC, where the unused logic (e.g., full adders) are 
gated, as shown in Figure 7.11. This reduces unnecessary switching activity, and consequently 
reduces energy consumption. The data-gated MAC can also be combined with voltage scaling 
to exploit the shortened critical path for additional energy savings [155]. 

While a data-gated MAC is a simple approach, it leaves many idle gates without increas- 
ing the throughput, making it inefficient in terms of throughput per area. Accordingly, there 
has been a lot of recent works that look at adding logic gates to increase the utilization of the full 
adders for higher throughput per area. One of the key challenges is to reduce the overhead of the 
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additional logic gates, while at the same time efficiently mapping the multiplication workload 
to the adders and maximize their utilization. In most of these works, the focus of the scalability 
exploration is in the multiplier since it consumes more area and energy than the accumulator. 

‘There are many similarities between the design of a DNN accelerator and the design of a 
precision-scalable MAC as highlighted in Camus et al. [212]. For instance, DNN accelerators 
with a spatial architecture contain multiple PEs within a PE array, while a spatial precision- 
scalable MAC contains multiple full adders within a spatial multiplier. In addition, the PEs in 
the PE array accumulate partial sums, while the full adders in the multiplier accumulate partial 
products (see Figure 7.10b). 

In the spatial precision-scalable MACs, the array of full adders within the multiplier are 
regrouped to form multiple multipliers with reduced precision, as compared to one reduce preci- 
sion multiplier in the data-gated MAC case. For instance, an 8b x 8b multiplier can potentially 
be converted into two 4b x 8b, four 4b x 4b, four 2b x 8b, eight 2b x 4b, or sixteen 2b x 
2b multipliers. Regrouping means that the partial products of the full adders belonging to the 
same multiplier are accumulated together. The accumulation of the partial products can occur 
temporally within a full adder, as shown in Figure 7.12a (e.g., [180, 213]). The temporal accu- 
mulation of partial product approach can exploit data reuse at the inputs (low input bandwidth), 
but generates many partially accumulated values in parallel (e.g., four 14-bit accumulator regis- 
ters in rightmost sub-figure of Figure 7.12a, and high register update bandwidth). Alternatively, 
the partial products can be accumulated spatially across different full adders, as shown in Fig- 
ure 7.12b (e.g., [214, 215]). The spatial accumulation of partial product approach requires unique 
inputs (high input bandwidth), but generates only one partially accumulated value at a time (e.g., 
only one 14-bit accumulator register in rightmost sub-figure of Figure 7.12b, and low register 
update bandwidth). 

For spatial precision-scalable MACs, the hardware overhead is due to (1) the additional 
adders and shift logic required to support parallel accumulations of partial products into sepa- 
rate final products and (2) the configuration logic around the adders to support accumulation 
across different groups of full adders. Additional output registers are also required for the “output 
stationary” approach. 

There are also similarities between DNN accelerators with a temporal architecture and 
temporal precision-scalable MACs. DNN accelerators with a temporal architecture generate 
and temporally accumulate multiple partial sums with a single PE, while temporal precision- 
scalable MACs generate and temporally accumulate multiple partial products for a multiplier are 
with a single full adder and shifter, as shown in Figure 7.13. Accordingly, the number of cycles 
required to complete the multiplication is proportional to the number of bits; in other word, the 
multiplier scales temporally with precision; this is often referred to as a bit-serial processing since 
the multiplication is computed bit-by-bit. To combat the fact that a bit-serial multiplier takes 
multiple cycles (rather than a single cycle in conventional multiplier), the bit-serial multiplier 
can be clocked at a higher clock frequency, since a single full adder and shifter has a shorter 


162 7. REDUCING PRECISION 
8 4b 4b 2b || 2b || 2b || 2b 


ELIN HEERE HEERE 
<2 <2 << mm 
+ 







































































b 
Y 

+ E + 
ms 


16b 


(a) Temporal accumulation of partial products 


2b || 2b || 2b || 2b 
Fl xe Eh xxl] 


> 



























































































































































8b 








> 












































































































































16b 13b% 12b È 


Ra 2 
+ 
, 
+ cE 
Y 
o i 
| + + + 
Y Y Y 
20bB 16b% 14b 6 
m 


(b) Spatial accumulation of partial products 

































































Figure 7.12: Examples of spatial precision-scalable MACs. For simplicity, in this example only 
one input is precision scalable (weights). Note that the designs with spatial accumulation take 
more inputs but keep all units busy and do not partition the final adder. (Figures adapted 
from [212].) 
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Figure 7.13: Example of temporal precision-scalable MACs. For simplicity, in this example only 
one input is precision scalable (weights). (Figure adapted from [212].) 


critical path compared to a spatial multiplier.” Furthermore, since bit-serial multipliers have 
a smaller area than spatial multipliers, and several of them can be used in parallel to increase 
throughput [96, 216, 217]. 

Bit-serial multipliers also provides an opportunity for exploiting sparsity at the bit level.’ 
Only the non-zero bits in the operand(s) require a computational cycle; the rest can be 
skipped [218]. At the extreme, the operand only has one non-zero bit, which makes it a multi- 
plication by a power of two, and can be implemented with just a shift (i.e., “multiplier-free”) as 
previously described in the context of log domain quantization. 

Camus et al. [212] evaluates the energy, area, bandwidth, and throughput of many varia- 
tions of precision-scalable MAC architectures including whether one or both inputs are scalable 
(i.e., 1D or 2D scalable) and multi-bit serial designs. It shows that the hardware overhead to 
support precision-scalable MACs beyond the simple data-gated MAC can be quite substantial 
and significantly reduces the benefits; this is a clear example of a trade-off between flexibility 
and efficiency. The overhead can be reduced by amortizing it across multiple MACs [219]. 


7.5. BINARY NETS 


At the most extreme, there have been works looking at reducing the number of bits for weights 
and/or activations down to a single bit (i.e., limit each input operand to only two unique values); 


7However, this may also increases the clock tree power. 
8Tn Chapter 8, we will discuss exploiting sparsity at the operand (value) level (i.e., in the weights and activation). 
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these types of DNN are often referred to as binary nets. In addition to reduce the memory 
requirements, binary nets also significantly reduce the cost of the computation. If one of the 
operands is binary, then the multiplication turns into an addition. If both of the operands are 
binary, the multiplication operation can be turned into an XNOR, and multiple multiplications 
can be performed with a bit-wise XNOR. Following the bit-wise operation, the accumulation 
is performed with a popcount.? Binary nets are widely used in conjunction with processing in 
memory accelerators, discussed in Chapter 10, where the compute is performed using the storage 
element, which typically stores a binary weight. 

One of the main drawbacks of binary nets is its impact on accuracy. Directly using binary 
weights and activation results in a 29.8 percentage point degradation in accuracy when applied 
to AlexNet for image classification on the ImageNet dataset [220, 221]; with such a large drop 
in accuracy, one might ask if traditional handcrafted approaches might offer a better accuracy 
versus energy trade-off [48]. 

Accordingly, binary nets are often combined with many of the previously discussed tech- 
niques to recover the accuracy. For instance, the first and last layers in the DNN are often kept at 
full precision (32-bit float), which is a form of varying precision that requires flexible hardware. 
Another example is XNOR-Net [221], which uses a different scale factor for the weights at 
each layer, and a different scale factor for the activations at each spatial location (i.e., H x W) 
in the feature map; this form of non-uniform quantization reduces the degradation in accuracy 
to 11 percentage point for AlexNet on ImageNet. 

While binary nets limit the weights and/or activations to two values, there may be benefits 
to allowing for a third value, specifically, the value of zero. Although this requires an additional 
bit per operand, the sparsity of the operand can be exploited to reduce computation and storage 
cost as discussed in Chapter 8, which can potentially cancel out the cost of the additional bit. 
DNNs that allow two unique values plus zero are often referred to as ernary nets [222]. 

Recent works have also explored mixed precision approaches, where the weights and 
activations use different forms of quantization, to achieve higher accuracy. For instance, 
QNN [210], DoReFA-Net [211], and HWGQ [223] use 1-bit with a scale factor for weights, 
and 2-bits for activations, with various forms of quantization. 

Using binary/ternary weights offer additional opportunities for optimizations. For in- 
stance, there tends to be more redundancy between filters since the number of unique weights 
is reduced to two (binary) or three (ternary). This can be exploited by reducing memory access 
or by transforming the filters to further increase sparsity as explored in Yin et al. [224]; more 
discussion on exploiting weight sparsity can be found in Section 8.1.2. 

Hardware implementations for binary/ternary nets have been explored in recent publica- 
tions. YodaNN [225] and Yin et al. [224] uses binary weights, while BRein [167] uses binary 


weights and activations. Binary weights are widely used in the processing-in-memory accelera- 


?Popcount (short of population count) is method to count the number of non-zero binary values in a vector. It is also 
often referred to as binary accumulation. Many CPU architectures today have a dedicated instruction for popcount. 


7.6. INTERPLAY BETWEEN PRECISION AND OTHER DESIGN CHOICES 165 


tors discussed in Section 10.2. Hardware implementation of these low precision networks have 
also been explored in FPGAs, where if binarizing the network sufficiently reduces the number 
of weights, then all the weights in the DNN can be “hard coded” via synthesis and stored on 
a single FPGA [226]. Finally, the nominally spike-inspired TrueNorth chip can implement a 
reduced precision neural network with binary activations and ternary weights using TrueNorth’s 
quantized weight table [14]. Most of the hardware implementations for binary/ternary nets are 
demonstrated on small DNN models such as LeNet, with a few on larger DNN models such as 
AlexNet [224, 225]. 


7.6 INTERPLAY BETWEEN PRECISION AND OTHER 
DESIGN CHOICES 


Up to this point, we have discussed many design choices related to reduced precision both from 
the algorithm and hardware perspective. Exploring reduced precision does not have to happen in 
isolation. In fact, it can be combined with other design decisions both at the algorithm and hard- 
ware levels for an even larger design space. This can often lead to improved trade-offs between 
efficiency and accuracy. 

Reduced precision can be explored in conjunction with the shape of the DNN model. For 
instance, the accuracy loss due to reduced precision can potentially be recovered by increasing the 
number of filters (i-e., widening the network) as explored in Wide Reduced Precision Network 
(WPRN) [227]. 

Reduced precision can also be accounted for in the design of the hardware dataflows dis- 
cussed in Chapter 5. For instance, if weights are only 1-bit, but the activations use a higher 
16-bit, an input-stationary dataflow may be more favorable than a weight-stationary dataflow 
as demonstrated in UNPU [96]. 


7.7. SUMMARY OF DESIGN CONSIDERATIONS FOR 
REDUCING PRECISION 


First and foremost, it is important to consider and carefully evaluate the impact of reduce preci- 
sion techniques on accuracy. This must include factors such as the difficulty of the dataset, task, 
and DNN model, as previously discussed in Section 3.1. For instance, while might be possible to 
reduce precision for an easy task (e.g., digit classification) without impacting accuracy, applying 
the same approach to a more difficult task may result in a significant drop in accuracy. 

It is also critical to ensure that the hardware cost to support reduced precision does not 
exceed the benefits. This is particularly true when considering variable precision, as it requires 
addition hardware, as discussed in Section 7.4. Another important factor to consider is the gran- 
ularity of the variable precision as more overhead is required to support fine-grained variability 
(e.g., 2b, 4b, 8b) than coarser-grained variability (e.g., 4b, 8b); however, having finer granularity 
can provide support for DNNs that require fewer bits. 
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Finally, when evaluating reduced precision approaches, it is important to compare to the 
correct baseline. As mentioned in Section 7.2.2, 8-bit fixed point precision for inference and 
16-bit floating point precision for training has already become common practice and hardware 
support for these precisions are readily available in commercial products; thus, using 32-bit float- 
ing point is considered a weak baseline. 
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CHAPTER 8&8 
Exploiting Sparsity 


A salient characteristic of the data used in DNN computations is that it is (or can be made to 
be) sparse. By saying that the data is sparse, we are referring to the fact that there are many 
repeated values in the data. Much of the time the repeated value is zero, which is what we will 
assume unless explicitly noted. Thus, we will talk about the sparsity or density of the data as 
the percentage of zeros or non-zeros, respectively in the data. The existence of sparse data leads 
broadly to two potential architectural benefits: (1) sparsity can reduce the footprint of the data, 
which provides an opportunity to reduce storage requirements and data movement. This is be- 
cause sparse data is amenable to being compressed, as described in Section 8.2;' and (2) sparsity 
presents an opportunity for a reduction in MAC operations. The reduction in MAC operations 
results from the fact that 0 x anything is 0. This can result in either savings in energy or time or 
both. In Section 8.3, we will discuss how the dataflows for sparse data can translate sparsity into 
improvements in energy-efficiency and throughput. However, first in Section 8.1 we discuss the 
origins and ways that one can increase sparsity in the data used in DNN computations. 


8.1 SOURCES OF SPARSITY 


Efficient processing of feature map activations becomes increasingly important as the size of 
the input to the DNN model grows (e.g., increased image resolution), while efficient processing 
of filter weights becomes increasingly important as the size of the DNN model grows (e.g., 
increased number of layers). 

This section will discuss various approaches that can exploit properties such as redundancy 
and correlation in the feature maps and filters to increase their activation sparsity (Section 8.1.1) 
and weight sparsity (Section 8.1.2), respectively. The requirements for these approaches may dif- 
fer as activation sparsity is often data dependent and not known a priori, while weight sparsity 
can be known a priori. As a result, methods to increase sparsity for weights can be performed of- 
fline (as opposed to during inference) and can be more computationally complex than methods 
applied to increase activation sparsity. For instance, increasing weight sparsity can be incorpo- 
rated into training. 


Note: We use the words sparsity or density to refer to a statistical property of the data, while we use the words compressed 
or uncompressed to describe the characteristics of a representation of the (typically sparse) data. 
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Figure 8.1: Sparsity in activations due to ReLU. 


8.1.1 ACTIVATION SPARSITY 


Sparsity in the activations can come from several sources. The most obvious and commonly 
exploited source is from the use of ReLU as a nonlinearity. Other sources include exploiting 
correlation in the input data or the upsampling of the feature maps in a up-convolution layer. In 
this section, we will discuss how design choices of the DNN model or additional processing of 
the input data or feature maps can reveal additional sparsity. 


Sparsity Due to ReLU 

As discussed in Section 2.3.3, ReLU is a popular form of nonlinearity used in DNNs that sets all 
negative values to zero, as shown in Figure 8.1a. As a result, the output activations of the feature 
maps after the ReLU have fewer non-zero values, i.e., are sparse; for instance, the feature maps 
in AlexNet have a percentage of non-zero values, i.e., density, between 19 to 63%, as shown in 
Figure 8.1b. This activation sparsity gives ReLU advantages over other nonlinearities such as 
sigmoid, etc. 

‘The activations can be made even more sparse by setting values below a certain threshold 
to zero; this is often referred to as pruning and can also be applied to weights, as discussed in 
Section 8.1.2. Activation pruning can be implemented by increasing the threshold in the RELU 
or reducing the bias in the filter. Such pruning of small-valued activations can be translated 
into an additional 11% speed up [228] for image classification on ImageNet with little impact 
on accuracy. Aggressively pruning more activations (i.e., increasing the threshold) can provide 
additional throughput improvement at the cost of reduced accuracy, as shown in Figure 8.2. 

The fact that ReLU effectively discards negative output activations opens up the possibility 
of using approximate computing to reduce the number of MAC operations. Specifically, we can 
terminate the computation of the output activation value if we can predict early on that the 
output will be negative. The main challenge in this approach is how early and how accurately 
can we predict that the output will be negative. Early prediction will allows us to terminate earlier 
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Figure 8.2: As the threshold for pruning activations increases, the accuracy drops as amount of 
sparsity, and consequently speed, increases. (Figure adapted from [228].) 


and avoid more operations, while accurate prediction helps reduce the impact on accuracy (i.e., 
an incorrect prediction may result in a drop in accuracy). At the same time, it is desirable to use 
simple logic to perform the prediction, as it adds overhead in terms of energy and area. 

Various works have explored different methods to predict that value of the output activa- 
tion with the goal of giving an early and accurate prediction with minimum additional compute 
overhead. For instance, PredictiveNet [229] and Song et al. [230] propose computing the most 
significant bits (MSBs) of the partial sum to predict whether the output will be negative and only 
compute the remaining bits (i.e., least significant bits (LSBs)) if the output is positive; this can 
translate into improvements in energy-efficiency and/or throughput. This can be implemented 
with a precision-scalable multiplier (e.g., bit-serial), or an nmsg-bit multiplier that computes the 
MSBs and additional hardware that is conditionally invoked to compute the remaining bits. 
SnaPEA [231] proposes reordering the weights based on their sign and then terminating the 
accumulation when the partial sum drops below a threshold, as shown in Figure 8.3. This is fea- 
sible since the order of accumulation does not affect the final result.” It also relies on the fact that 
the input to the filter will be zero or positive; this is true if the input activations were processed 
with a ReLU in the previous layer, and the input data to the first layer is zero or positive (e.g., 
pixels in an image). Finally, the cost of sorting and reordering of the weights can be amortized 
across multiple inputs and can happen offline if the weights are known in advance. 


? This is not exactly true for floating point, but should be sufficiently accurate. 
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Figure 8.3: Example of how we can reduce number of MAC operations if we can predict that 
the output will be zero. (Figure adapted from [231].) 


Correlation in Input Data 

Depending on the application, there may be correlation in the input data that can translate into 
the correlation in the activation values of the feature maps, both of which can be exploited. In 
terms of image and video data, this can manifest itself in terms of spatial or temporal correlation, 
where the values of neighboring pixels within the image, or pixels in consecutive frames, tend 
to be similar, respectively. In fact, it is this type of correlation that is exploited by popular and 
widely used image and video compression standards such as JPEG [232], H.264/AVC [233], 
or HEVC [234]. Correlation can also be found in other types of data, such as speech; however, 
in this section we will use image and video as driving examples. 

One approach to exploiting this correlation is to process the difference between the corre- 
lated pixels (activations) rather than each pixel (activation) independently. If the pixels (activa- 
tions) are well correlated, then it is likely that difference between them will be zero or close to 
zero; thus we can compute a sparse difference map between correlated pixels (activations), where 
the degree of sparsity depends on the degree of correlation between the pixels (activations). 

We can examine this approach using the following toy example. Assume we would like 
to multiply two pixels (activations), a; and a2, with the weight w. We could compute these 
products separately as follows: 

yı 541 XW 
y2 = a2 X W. 


(8.1) 


This would require two multiplications. Alternatively, we could compute the difference between 
aı and a2, and compute 


yı =a, X W 


8.2 
y2 =4ı X w + (a2 — 41) X W = yı + Aa X wW. (a 


We can reuse yı in the computation of y2. If a; and az are the same, then A, is zero, and the 
above calculation requires one multiplication rather than two. 

The main challenge in this type of approach is the computational overhead for generating 
the difference map (i.e., Ag). On the one hand, it would be desirable to minimize the effort 
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(a) Activation values in feature map (b) Delta values in difference map 
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Figure 8.4: Delta values (i.e., Aq) in difference map (left) generated by exploiting correlation 
between immediate horizontal spatial neighbors in feature map (right). Notice that delta values 
tend to be either zero or close to zero as compared with activations in feature map. (Figure 


from [235].) 


to compute the difference map; on the other hand, it would be desirable to find the pixels (ac- 
tivations) that are the most correlated, and thus maximize the sparsity of the difference map. 
Furthermore, additional storage is required to store the previous product (i.e., yı in the toy ex- 
ample). For temporal correlation, this can be costly as it may require storing the entire previous 
frame and/or all the intermediate feature maps of the previous frame. 

A simple approach for computing the difference map is to compute the difference between 
immediate neighboring pixels. For instance, Diffy [235] generates a difference map based on the 
difference between immediate horizontal spatial neighbors within an image, as shown in Fig- 
ure 8.4. Another example is Riera et al. [236], which generates the difference map between pixels 
(activations) at the same spatial position in consecutive frames (feature maps) to exploit tempo- 
ral correlation, as shown in Figure 8.6b; these pixels can be thought of as immediate temporal 
neighbors. To increase the correlation between consecutive frames (and their feature maps), 
additional quantization can be applied to reduce the number of unique values, as discussed in 
Chapter 7; however, this may come at the cost of reduced accuracy. 

When considering temporal correlation, particularly in video, it can be beneficial to ac- 
count for the moving objects within the video (i.e., change in the location of objects across 
frames). This can be represented in the form of assigning motion vectors to different objects or 
pixels in the video, which indicates the correlation between pixels across frames, as shown in Fig- 
ure 8.5. While using motion vectors can help identify highly correlated pixels, and thus enable 
increased sparsity in the difference map, the amount of computation needed to generate these 
motion vectors can be very expensive. One solution is to build specialize hardware to perform 
the motion estimation necessary to generate the motion vectors as proposed in EVA? [237]. 
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Figure 8.5: (a) The motion vector indicates how pixels move between frame t — 1 and ¢. It can 
also be used to indicate which pixels (or activations) are temporally correlated between consec- 
utive frames. (b) Motion estimation computes the motion vector mv“—!) based on shift of 
pixels (or activations) within frame. (c) Motion compensation warps (shifts) the pixels in frame 
t — 1 based on motion vector mv“~!”) to align them with frame t. The different map can be 
computed between the motion compensated frame ¢ — 1 and original frame t. 


Motion estimation is a popular form of computation used in video compression and im- 
age processing, and thus the motion vectors might be freely available if one considers the in- 
teraction of the DNN processing with other parts of the system. For instance, if the DNN 
accelerators is part of a larger System-on-Chip (SoC), it may be possible to obtain the motion 
vectors from other blocks in the system such as the Image Signal Processor (ISP) as proposed 
in Euphrates [238]. Another option might be to consider the format of the incoming data. If 
the incoming data is in compressed form (e.g., compressed video using H.264/AVC or HEVC), 
which is usually the case for streaming or stored video, the motion vectors are already embedded 
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in the syntax elements of the compressed video itself and can be directly accessed at low cost as 
proposed in FAST [239]. 

There can also be advantages of processing the difference map even if the delta values (i.e., 
Aq) are not zero, but close to zero. For instance, for the values close to zero, most of the higher- 
order bits will be zero and bit-serial processing, as discussed in Chapter 7 can be used to reduce 
the cost of the MAC operations [235]. 

Finally, the approaches described in this section can be selectively applied to the input 
data and the feature maps, depending on the amount of correlation and thus sparsity, as well 
as the computation and storage costs. For instance, when processing images, exploiting spatial 
correlation can be applied to the input data and all the intermediate feature maps, as the amount 
of additional storage is limited to only the previous pixel or activation [235]. 

On the other hand, exploiting temporal correlation is more expensive in terms of storage 
cost. As previously discussed, the entire previous frame or feature map may need to be stored, 
and the number of feature maps to be stored can increase with the number of layers in the DNN 
model. Furthermore, the temporal correlation between feature maps often reduces as we go 
deeper into the DNN model, as the feature maps become more different from the input image, 
which is the source of the correlation.’ 

Accordingly, there tends to be more variation in temporal correlation based approaches, 
as shown in Figure 8.6. For instance, EVA? [237] uses temporal correlation to generate the 
difference map for the feature map of an intermediate DNN layer; the feature maps of the sub- 
sequent DNN layers are directly processed (not their difference maps), as shown in Figure 8.6c. 
Alternatively, in FAST [239], temporal correlation is selectively applied to different regions of 
the frame; highly correlated regions exploit temporal correlation in the final output feature map 
(i.e., the DNN outputs are copied from the previous frame and no DNN layer processing needs 
to be performed) as shown in Figure 8.6d, while the other regions need to undergo full DNN 


processing. 


Up-Convolution Layers 

Another source of sparsity comes from the use of up-convolution layers, as previously discussed in 
Chapter 2 [60].* Up-convolution layers and its variants are often used in DNN models that gen- 
erate dense output predictions (e.g., assign a label to each pixel in an image) for applications such 
as semantic segmentation [92, 240, 241], depth estimation [64, 242], super-resolution [243- 
245], and style-transfer [246]. Specifically, they are typically used in the applications that gen- 
erate high-resolution outputs, such as generative adversarial networks (GANs) and image seg- 


3The spatial correlation within a feature map also reduces as we go deeper into the DNN model; however, the storage 
cost for exploiting spatial correlation tends to be less than temporal, thus designs such as [235] apply this approach to all 
intermediate feature maps. 

4Note variants of the up-convolution layer with different types of upsampling include deconvolution layer, sub-pixel or 
fractional convolutional layer, transposed convolutional layer, and backward convolution layer [69]. 
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(b) Exploit correlation of immediate temporal neighbors for all feature maps 


Figure 8.6: Various ways to exploit temporal correlation. (a) Baseline where all feature maps for 
sequential frames are processed separately, and no temporal correlation is exploited. (b) Exploit 
correlation of immediate temporal neighbors for all feature maps [236]. The difference map 
(A) is computed by directly subtracting the frame or feature map of time t — 1 from ¢. After 
processing the difference map (A) by the filters (W;), the frame or feature map of time t — 1 is 
added back, which follows Equation (8.2). (Continues.) 
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(c) Exploit temporal correlation to skip early layers 
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Figure 8.6: (Continued.) Various ways to exploit temporal correlation. (c) Motion estimation 
(ME) is used obtain the motion vector (mv¢—!)) between the frames at time t — 1 and t. 
Processing of the earlier layers is skipped for frame t. The final output (output) is generated 
by only applying the later layers (in this example, layer 2) to the motion compensated (MC) 
intermediate feature map (in this example, the input feature map of layer 2, ifmap’,_'). The trade 
off between how many layers are skipped and accuracy is explored in [237]. (d) Processing of a// 
layers is skipped for frame ft. The final output (output') is generated by only applying motion 
compensation (MC) to the final output of frame t — 1 (output'~'). In [238, 239], the motion 
estimation (ME) can also be skipped, since the motion vector (mv“~!”) can be obtained directly 
from the data or another part of the system. 


(d) Exploit temporal correlation to skip a// layers 
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Figure 8.7: Example upsampling methods in decoder layer. (Figures adapted from [64].) 


mentation, which differ from applications that generate a single value output, such as image 
classification. 

To generate the dense output, the up-convolution layers and its variants increase the size of 
the feature map by producing an output feature map whose spatial dimensions (i.e., P and Q in 
the DNN model dimensions) are larger than its input feature map (i.e., H and W in the DNN 
model dimensions). This is in contrast to approaches such as strides and pooling that reduce 
the size of the feature map. Increasing the size of the feature map can be done by upsampling 
the input feature map before processing it with the filter weights, as discussed in Section 2.3.4. 
Common forms of upsampling used in up-convolution layers and its variants include inserting 
zeros between the activations, as shown in Figure 8.7a, interpolation using nearest neighbors as 
show in Figure 8.7b, and interpolation with bilinear or bicubic filtering. 

Upsampling introduces sparsity and correlation into the input feature map that is struc- 
tured and known a priori. For zero-insertion, the sparsity is coming from the zeros being in- 
serted between rows and columns of input values, leading to around 75% sparsity. For nearest- 
neighbors upsampling, correlation is coming from a pixel value being copied into adjacent pixel 
locations, resulting in windows of pixels that are known to have identical values. Accordingly, 
the cost of detecting sparsity or computing the difference map can be significantly less than for 
the sources of sparsity described in the previous sections, which tend to be unstructured or data- 
dependent. In fact, in some cases the input feature map and/or filter weights can be restructured 
such that input feature map can be processed in a dense form and avoid the overhead of sparse 
processing all together. For instance, the filter can be decomposed into a set of smaller filters 
such that the input feature map is processed in a dense form, and the outputs of those smaller 
filters can be interleaved to form the larger output feature map, as shown in Figure 8.8 [242]; 
this makes the processing of the decoder layer similar to the convolutional layer [69]. 


8.1.2 WEIGHT SPARSITY 


Sparsity in the weights can come from several sources. Repeated weights in a filter naturally 
occur if the number of weights in the filter exceed the number of unique weights, which is 
typically set by the number of bits per weight (e.g., a 3 x 3 x 128 filter will have repeated weights 
if each weight is 8-bits, since 8-bits can only represent 256 unique values). Repeated weights 
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Figure 8.8: Fast upsampling by exploiting structure in activation feature map. (Figure 
from [242].) 





can also be enforced in the DNN model through the design of the filter or during training by 
setting weights to zero (referred to as pruning) or by reducing the number of unique weights 
(e.g., with reduced precision, as discussed in Chapter 7)2 In this section, we will discuss how 
various design choices for the DNN model can be used to reveal additional weight sparsity 
including weight repetition for both zero and non-zero values. As previously discussed, unlike 
activation sparsity, which in most cases is data dependent and needs to be computed online (e.g., 
generating difference map during inference as discussed in Section 8.1.1), the approaches used 
to increase weight sparsity can be done offline, and thus can be more computationally intensive 
(e.g., can be incorporated into training). 


Weight Reordering and Reuse 

Repeated weights can appear in a filter for a variety of reasons as previously discussed. Re- 
ordering the weights, and thus the activations and operations, can help reduce the number of 
weight accesses from memory as well as the number of multiplications. For instance, the weight 


5 Enforcing repeated weights can also be thought of as a form of weight sharing, since the same weight is shared across 
many locations in the filter. 
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can be multiplied with the sum of the activations rather than each activation as demonstrated in 
UCNN [247]. 

We can illustrate this approach using a toy example. If the filter weights are 
[a,b,a,c] and the input activations are [w, x, y, z], rather than computing the dot product as 
aw + bx + ay + cz, which would require four weight reads, four multiplications, three ad- 
ditions, the weights and subsequently the activations can be reordered to be [a,a,b,c] and 
[w, y, x, z], and the dot product can be computed as a(w + y) + by + cz, which would require 
three weight reads, three multiplications, and three additions. This approach is also referred to 
as dot product factorization; it was inspired by the re-association used by the transforms (e.g., 
Winograd) described in Section 4.3, which were used to reduce the number of multiplications. 

Weight reordering does not impact accuracy. The main challenge with weight reordering is 
the hardware overhead due to: (1) the indirection tables required to fetch the activations and the 
weights in the desired order, which increase storage and memory accesses;° and (2) the increase 
in the number of bits in the adder and the input operand to the multiplier. 


Network Pruning 

To make network training easier, DNN models are usually over-parameterized. Therefore, many 
of the weights in a DNN model are believed to be redundant and can be removed (pruned, i.e., 
set to zero) without reducing accuracy. In fact, it has been shown that sparse DNN models (i.e., 
DNN models with high weight sparsity) tend to have higher accuracy than dense DNN models 
for a fixed number of effectual (i.e., non-zero) weights [248]. Network pruning refers to the 
process of removing weights in the DNN model. 

The concept of network pruning dates back to the late-1980s to mid-1990s [249-252]. 
The popularity of DNNs has renewed interest in this topic resulting in a significant amount of 
research over the past few years. Many of today’s approaches outperform random pruning when 
generating a sparse DNN model [248, 253-255]. 

Network pruning algorithms tend to follow the same process, which is illustrated in Fig- 
ure 8.9. They begin with a large pre-trained dense DNN model and undergo the following two 
stages: (1) weight removal to determine which weights to remove or set to zero; and (2) fine 
tuning to update the values of the weights (typically the remaining non-zero ones). These two 
stages are typically iteratively applied several times to gradually increase the weight sparsity in 
the DNN model. The process of assigning the number of weights to prune per iteration is re- 
ferred to as scheduling. We will now describe the various stages in detail and give examples of 
how they can vary across different network pruning algorithms. 

Weight removal can be broken down into three main steps, as shown in Figure 8.9: 


®Tn the previous example, the activation indirection table would store [0, 2, 1, 3] while the weight indirection table would 
store [0, 0, 1, 2]. This overhead also exists for the weight reordering approach discussed in Section 8.1.1, which proposed early 
termination of the output activation calculation when using ReLU. 
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Figure 8.9: The typical process of network pruning. Network pruning begins with a large pre- 
trained dense DNN model. A two-stage iterative process is then applied involving weight re- 
moval followed by fine tuning to remove and update the weights, respectively. The scheduling 
process determines the number of weights that are pruned for each iteration. The output of 
network pruning is a sparse DNN model. 


e Scoring: Assigns a score to each weight or a group of weights based on a given criterion. 
The most common criterion is the impact of the weight(s) on accuracy. 


* Grouping: Weights can be grouped based on a pre-defined structure to allow groups of 
weights to be removed rather than a single weight. 


e Ranking: The weights are ranked based on their scores. Depending on grouping, each 
weights can be ranked individually or each group of weights are ranked relative to other 
groups. ‘The likelihood that each weight or group of weights is removed is based on its 
rank. 


There are various approaches for scoring the weights. The goal of many of these approaches 
is to use a score that can help minimize the impact on accuracy while maximizing weight sparsity 
and/or minimizing the number of MAC operations. For instance, early works such as Optimal 
Brain Damage [256] assigned scores to the weights based on the impact of each weight on the 
training loss (discussed in Section 1.2), referred to as weight saliency. However, weight saliency 
is expensive to compute for today’s large DNN models, since it requires the second derivative to 
be determined for each weight. 

Currently, the most popular approach of scoring the weights uses the magnitude of the 
weights as the scores, which is often referred to as magnitude-based pruning [250, 257]. The 
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(c) Pruning based on the impact on the output feature map 
(i.e., feature-map-based pruning) 


Figure 8.10: Example showing how pruning based on the impact on the output feature map (i.e., 
feature-map-based pruning) can result in a higher accuracy (lower error) than magnitude-based 
pruning while achieving the same sparsity. Values in red indicate pruned weights. 


motivation behind this approach is that removing weights with small magnitudes will minimize 
changes to the filters and thus potentially minimize the impact on the feature maps and hence 
the accuracy; furthermore, evaluating the magnitude of the weights is easy to do. Several works 
have also explored the idea of scoring the weights based on their impact on the corresponding 
output feature maps to maximize the accuracy, called feature-map-based pruning [253]. Feature- 
map-based pruning can achieve higher accuracy with the same sparsity than magnitude-based 
pruning since it minimizes the change to the feature maps directly rather than indirectly through 
minimizing the change to the filters; for instance, weights with large magnitudes but opposing 
signs could both be removed if their effect on the output feature map cancels each other out, 
as shown in Figure 8.10. However, the cost of evaluating the impact of the weights on the 
output feature map can be more complex than magnitude-based pruning, since they also require 
knowledge of the input data to the filters. 
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Figure 8.11: Energy estimation methodology [259] used for energy-aware pruning in [253]. It 
estimates the energy based on data movement from different levels of the memory hierarchy, 
number of MACs, and data sparsity. This tool is available at https://energyestimation.mit.edu/. 


In addition to considering the impact of each weight on the accuracy when scoring then 
weights, it is also important to consider the impact that each weight has on other metrics (Chap- 
ter 3) such as energy efficiency and throughput. In other words, we want to remove the weights 
that achieve not only the smallest decrease in accuracy but also the largest increase in energy ef- 
ficiency or throughput after being pruned. For instance, energy-aware pruning considers both of 
the energy and the accuracy while scoring the weights, which results in a 1.7x increase in energy 
efficiency compared to magnitude-based pruning for the same accuracy [253]. Alternatively, the 
impact on latency can also be considered as proposed in NetAdapt [258]. These approaches are 
often referred to as hardware-aware or hardware in the loop as they incorporate hardware met- 
rics such as energy and latency into the DNN model design process; in other words, they are 
mechanisms for hardware and DNN model co-design. These hardware metrics can be obtained 
using throughput and energy estimation tools [187, 259, 260] or using empirical measurements 
directly obtained from the hardware itself [258]. Figure 8.11 shows an example of the energy 
estimation methodology [259] used for energy-aware pruning in [253]. 
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Figure 8.12: Pruning across various degrees of granularity. 


There are also various approaches for grouping the weights. The weights can be pruned 
individually’ (referred to as fine-grained pruning) or in groups with a pre-defined structure (re- 
ferred to as coarse-grained pruning). For coarse-grained pruning, the weights in the same group 
can be restricted to specific locations with different degrees of granularity, such as the same col- 
umn, row, channel, or filter [258, 261-264]. Figure 8.12 illustrates the differences between these 
different forms of grouping.® It should be noted that pruning channels or filters can be thought 
of as changing the layer shape of the DNN network architecture (e.g., changing C or M of the 
DNN model dimensions in Figure 2.2b); therefore, tools that perform automatic channel or fil- 
ter pruning, e.g., NetAdapt [258], can also be viewed as a form of network/neural architecture 
search (NAS), which is discussed in Chapter 9. 

There are several things to consider when selecting the amount of granularity. On one 
hand, fine-grained pruning often results in higher degrees of sparsity than coarse-grained prun- 


7This can be viewed as having only one weight in a group. 

8In the literature, coarse-grained and structured pruning are often used interchangeably when referring to pruning channels 
or filters, while fine-grained and unstructured pruning are often used interchangeably when referring to pruning individual 
weights. 
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ing (for a given accuracy) since fine-grained pruning is less constrained. On the other hand, it 
is more costly for hardware to exploit fine-grained pruning than coarse-grained pruning since it 
must check for non-zero weights more frequently and at any location. In addition, there is more 
overhead for signaling the location of non-zero weights, which reduces the benefits of using 
compression; this overhead is discussed in Section 8.2. 

Due to the challenges of handling DNN models with fine-grained sparsity, they are typ- 
ically best suited for processing on custom sparse DNN accelerators, which are discussed in 
Section 8.3. While general-purpose platforms such as CPUs and GPUs do have libraries that 
handle coarse-grained sparse data (e.g., Sparse BLAS, cuSPARSE), they often require extremely 
high sparsity (e.g., >90%) to achieve throughput benefits. Therefore, DNN models with coarse- 
grained sparsity are typically preferred for platforms such as CPUs and GPUs. ‘The granularity 
of pruning can also be customized to the hardware platform to increase the impact on through- 
put for the same amount of weight sparsity. For instance, Scalpel [265] matches the pruning 
granularity to the underlying hardware parallelism, specifically, the SIMD width, and achieves 
a 2 to 3x speed up compared to fine-grained pruning. This is yet another example of bringing 
hardware in the loop of DNN model design. 

Weight reordering can be used on top of pruning to increase the coarseness of the sparsity, 
as shown in Figure 8.13b [266, 267]. Specifically, non-zero weights can be grouped together to 
form a dense weight matrix, which can then be processed using hardware that efficiently supports 
dense matrix multiplications, including general purpose platforms such as CPUs and GPUs, as 
well as dense DNN accelerators, such as the processing-in-memory based DNN accelerators 
discussed in Chapter 10. 

There are also various approaches for ranking the weights. The weights can be ranked at a 
local scale (e.g., per layer) or at a global scale (e.g., across the entire DNN model). Global rank- 
ing of all the weights within the DNN model often provides better performance (e.g., higher 
sparsity for a given accuracy) than local ranking. However, global ranking is often more com- 
putationally complex than local ranking because global ranking involves more comparisons and 
requires the use of scoring methods that can fairly compare weights in different layers, such 
as weight saliency [256], which are often computationally expensive. In contrast, local ranking 
typically only ranks the weights within each of the layers, which requires fewer comparisons 
and allows the use of simple scoring methods, such as magnitude-based pruning [250, 257] or 
feature-map-based pruning [253], which makes it a popular choice. One of the main challenges 
of local ranking is that per-layer sparsity needs to be specified and the optimal specification 
usually difficult to determine. 

Fine tuning updates the values of the remaining weights to restore accuracy. There are 
several important design decisions that are considered in the various approaches used for fine 
tuning. ‘The first is how to initialize the weights at the beginning of fine tuning. For instance, the 
weights can be reinitialized each iteration, they can start from the state at the end of the previous 
iteration (most popular), or they can rewind to an earlier state [268]. Another important decision 
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(a) Pruning without weight reordering 
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(b) Pruning with weight reordering 


Figure 8.13: Applying weight reordering on top of pruning to increase the coarseness of the 
sparsity. In this example: (a) before weight reordering, the granularity of the sparsity is 1x2; and 
(b) after weight reordering, the granularity of the sparsity is 3x4. Note that the input activations 
and output activations are also reordered. 


is whether the weights are updated by performing a global or local optimization. Performing a 
global optimization is similar to typical network training approaches, where all weights across 
all layers are jointly optimized and updated simultaneously. Performing a local optimization, 
where only a subset of weights are jointly optimized and updated simultaneously, can speed up 
the time required for fine tuning. For instance, fine tuning can be performed per layer, where 
the weights are updated by minimizing the difference between the output feature maps before 
and after pruning [253]. This minimization has a closed-form solution, so local optimization 
can be performed faster than global optimization. Finally, since fine tuning updates the weights, 
previously pruned weights might become important at a later iteration. Accordingly, some recent 
works explore the idea of restoring some of the pruned weights during fine tuning [269-271]. 
Scheduling involves determining how many weights to prune in each iteration in order 
to achieve a target weight sparsity in the final sparse DNN model. We can set the number of 
weights per iteration and desired weight sparsity explicitly [257] or infer them from energy [253] 
or latency [258]. The common scheduling methods include: (1) pruning all the weights required 
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to achieve the target number of weights in a single iteration (one shot) [272]; (2) pruning a fixed 
fraction of weights at each iteration across multiple iterations [257]; or (3) pruning different 
fraction of weights at each iteration across multiple iterations [254]. Scheduling becomes more 
challenging when ranking the weights locally, since the number of weights to prune per iteration 
needs to be further specified at a finer granularity (e.g., on a per layer basis rather than for 
the entire network). One approach to tackling this challenge is to apply pruning to only one 
layer per iteration, specifically the layer that achieves the best accuracy versus latency (or energy 
consumption) trade-off [258] with pruning. 

It should be noted that there is an interplay between network pruning and the shape and 
size of the DNN model (i.e., the network architecture of the DNN model). While pruning can 
improve the efficiency of a given DNN model, switching toa DNN model with a more efficient 
network architecture, as described in Chapter 9 can often result in better efficiency, as shown 
in Figure 8.14, where efficiency is defined as number of weights or MAC operations versus 
accuracy [248]; this has yet to be explored for energy efficiency or throughput versus accuracy. 
Furthermore, network pruning is more effective on inefficient DNN models than DNN models 
that already have an efficient network architecture [248]. For instance, pruning on AlexNet 
can reduce the number of weights by 10.6x, the number of MACs by 6.6x, and the energy 
consumption by 3.7x, whereas pruning on GoogLeNet only reduces the number of weights 
by 2.9x, the number of MACs by 3.4x, and the energy consumption by 1.6x [253]. Finally, 
pruning is often more effective in reducing the number of weights on fully connected layers 
than convolutional layers. For instance, several of the fully connected layers in AlexNet can 
be pruned to over 90% weight sparsity, while the convolutional layers only reach around 60% 
weight sparsity [253]; however, this could also be because AlexNet has three fully connected 
layers, whereas more modern DNN models (e.g., GoogLeNet, ResNet, MobileNet) only use 
one fully connected layer, and thus the three fully connected layers of AlexNet are extremely 
over-parameterized. 

Network pruning continues to be a very active field in both research and industry. Un- 
fortunately, the field currently lacks standardized benchmarks and metrics to properly compare 
and evaluate the numerous methods that have been developed, making it difficult to measure 
the progress in the field. To address this, Blalock et al. [248] identifies issues with current prac- 
tices based on a survey of over 80 papers (e.g., lack of controlled comparisons, imprecise and 
incomplete specification of experimental setup and metrics), provides various solutions to those 
issues (e.g., report results as multiple points on a trade-off curve, with multiple dataset and DNN 
model combinations), and proposes the use of an open-source framework called ShrinkBench, 
which provides a standardized collection of pruning primitives, model, datasets, and training 
routines to enable standardized evaluation of pruning methods. 
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Figure 8.14: Pruning versus network architecture of DNN model. Using a DNN model with a 
more efficient network architecture (e.g., EfficientNet or MobileNet) can result in a better trade 


off between accuracy and number of weights and accuracy and number of MAC operations. 
(Figure adapted from [248].) 


Dilated Convolutions 

Increasing the receptive field of a filter (i.e., the spatial dimensions of the filter R and S of the 
DNN model dimensions in Figure 2.2b) can sometimes help increase accuracy, since each output 
activation is a combination of a larger region of inputs (e.g., in an image this could mean seeing 
the entire face versus just the nose, or seeing more of the environment around an object, which 
could provide more context to help identify the object). However, increasing the receptive field 
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Dilation Rate: 1 Dilation Rate: 2 Dilation Rate: 4 


Figure 8.15: Filter for dilated convolution at various dilation rate (i.e., number of zeros inserted 
between non-zero weights). 


increases the number of weights in the DNN model and the number of MAC operations that 
need to be performed per output activation. 

Various approaches can be applied to achieve the same large receptive field while reducing 
the additional cost. One approach is stacking multiple filters with smaller receptive fields; this 
was explored in DNN models such as VGGNet [73], and is discussed in more detail in Chap- 
ter 9. Another approach is spatial pooling (described in Section 2.3.4), which is often used for 
applications such as image classification, where the output of the DNN model is reduced down 
to one label. 

For DNN models with dense outputs, another way to increase the receptive field without 
significant complexity overhead is to insert zeros between the weights of the filter, as shown 
in Figure 8.15; this can be thought of as upsampling the filter, similar how we upsampled the 
input feature map, as discussed in Section 8.1.1. This approach is commonly referred to as dilated 
convolutions [273] or a¢rous convolution [240] and is another source of weight sparsity. As 
mentioned in Section 8.1.1, the sparsity from upsampling is structured and known a priori and 
thus the cost of detecting sparsity can be significantly reduced compared to the unstructured 
sparsity described in the previous section. 


8.2 COMPRESSION 


‘The existence of sparsity in both weights and activations, as just described in the previous sec- 
tions, inspires the application of techniques to compress the data to reduce storage space, data 
movement, and/or computation to save time and/or energy in DNN accelerators. To explore this 
opportunity, recall that in Chapter 2 it was noted that the multi-dimensional operands used in 
DNN computations can be viewed as tensors. For instance, input activations can be represented 
as 4-D tensors with dimensions N, H, W, and C and filter weights can be represented as 4-D 
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tensors with dimensions R, S, C, and M. And since the data in these tensors is often sparse we 
will discuss compression in the context of compressing tensors. 

There have been a large number of formats proposed for representing sparse tensors as 
nicely summarized by the TACO project in [274]. From a hardware design perspective, each of 
these formats have characteristics that distinguish them along one of several axes. For example, 
they can vary in size in memory, cost of accessing the elements of the tensor and the cost of 
operators that modify the tensor (e.g., add an element) or combine it with other tensors (e.g., 
intersection). These metrics can also vary with the sparsity of the tensor or other statistical char- 
acteristics (e.g, clustering along a diagonal). In conjunction with a design that uses a specific 
format, these characteristics will be reflected in many of the operational metrics described in 
Chapter 3. 

Rather than just enumerate compression techniques, however, we will present an abstrac- 
tion for the representation of tensors in memory and the operations that can be performed on 
them. Within this abstraction the various attributes of different compression techniques can be 
classified and characterized, and the opportunities for mixing and matching techniques should 
be evident. 

Providing a common abstraction for the operations that can be performed on a tensor 
has a further benefit in understanding the hardware (e.g., its dataflow) used to process sparse 
tensors. We find that often design descriptions get bogged down in the detailed manipulations 
required by a specific data representation. These details can obscure the high-level principles 
behind the design. Having a common abstraction for any sparse tensor representation and a 
common set of operations on that abstraction allows for a separation between the details of the 
exact manipulations required to operate on a specific representation of the tensor in memory and 
the essence of the dataflow. Hopefully, this will provide an opportunity to more clearly compare 
and contrast different designs and gain insights into the tradeoffs between them. 

Note, however, that although there is a conceptual partitioning between the algorithmic 
activity (e.g., dataflow) on the tensors and their representation, design decisions will need to 
take both into account. A design that relies heavily on a particular operation probably should 
not be paired with a representation for which that operation is especially expensive. Instead, 
this should be viewed as an opportunity for a co-design process, where the design’s algorithmic 
activity and the representation for the tensors used in the design are jointly selected for overall 
optimal metrics. Furthermore, a specific design may employ cross-layer optimization breaking the 
strict boundary between the implementation of the algorithm and the implementation of the 
data representation. 


8.2.1 TENSOR TERMINOLOGY 


Figure 8.16 shows a matrix (i.e., a 2-D or rank-2 tensor) with the terminology we are going to 
use to describe tensors. Specifically, the axes of the matrix are referred to as ranks and labeled with 
a rank-id. In this case, the matrix has two rank-ids (H and W), corresponding to the ranks of a 
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Coordinates Point (1,2) 


Figure 8.16: A rank-2 tensor (i.e., a matrix) showing the key terms for describing a tensor: rank, 
coordinate and point. Assuming the ranks are ordered by their ids as H, W, the value at the point 
(1, 2) is “P”. 
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Figure 8.17: A tensor represented as a tree. Each level of the tree corresponds to a rank of the 
tensor, and contains a node for each coordinate in that rank. The leaves of the tree are the values 
at each point in the tensor. 


channel of input activations. The individual elements of a rank are also labeled with a coordinate 
and thus the value at an individual cell (or point) in the matrix can be identified by a tuple of 
coordinates - one for each rank. Thus, the value “f” at coordinate 1 in rank H and coordinate 2 
in rank W is at point (h, w) = (1, 2). 
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To discuss the scope of compression opportunities, we are going to explore an expanded 
version of the tree-based tensor representation described in [275] and used in [276]. Figure 8.17 
shows an abstraction for a tensor as a tree with a root node (black diamond) and intermediate 
nodes (brown circles), leaf nodes (blue squares), and edges (black lines). Below the root node 
(black diamond “R”) each level (dashed oval) corresponds to a rank (or dimension) of the ten- 
sor.” The first rank is the dashed oval “H”. Each intermediate node of the tree corresponds to 
a coordinate in some rank of the tensor. For intermediate ranks, a node’s children are the co- 
ordinates of the next lower rank (e.g., dashed oval “W?” of the tensor). At the lowest rank, the 
child of a coordinate (i.e., a leaf node) is the value of a specific point of the tensor (blue square). 
Note, one can determine the point a value is at from the sequence of coordinates passed at each 
rank while traveling down the tree from the root to the value. Thus, the value “c” is at point 
(h, w) = (0, 2). 

An important and common operation on a tensor is finding the value at a particular point 
(i.e., ordered set of coordinates) in the tensor. In this tree representation, it should be clear that 
finding a point in the tensor involves traversing the tree looking for the desired coordinate at each 
rank in turn. Although abstractly the order of the ranks have no meaning, in most concrete tensor 
representations the order of the ranks is of great importance, because the order can significantly 
affect the cost of accesses and therefore is often a salient characteristic of any algorithm that 
operates on the tensor. ‘Thus, even in our abstract representation of a tensor the order is manifest 
in the representation. We discuss this in more detail later, but first we are going to consider 
sparsity. 

Representing a sparse tensor in this tree is straightforward, as illustrated in Figure 8.18. In 
the figure each element of a rank is manifest simply as a coordinate node having fewer children 
(i.e., just the coordinates of non-empty elements in the next rank). Furthermore, if we ignore 
the details of the implementation of the nodes and edges of the tree, this representation abstracts 
most of the details of the way a tensor is represented in storage, giving the accelerator designers 
the opportunity to optimize the representation for the storage and operational characteristics 
desired in their design. 

Although viewing a tensor as a tree where each node has edges emanating from it con- 
necting to each of its multiple children is not an unreasonable model, to improve the relationship 
between our model and actual design considerations we are going to use an alternative model. 
In specific, we are going to assume that each coordinate node has a single payload that is either 
(1) a collection of coordinates each with their own payload (at intermediate levels in the tree) or 
(2) a single value at the lowest level (i.e., rank) of the tree. 

For case 1, the collection of coordinates that are children of a single coordinate will be 
called a /ider, as shown in Figure 8.19.1° A rank will therefore contain one fiber for each coor- 
dinate in the rank above, and each fiber will be the payload of one of those coordinates. 


*Later we will discuss optimizations that relax this assumption. 
10This roughly corresponds to the mathematical notion of a fiber where all coordinates of a tensor are fixed except one. In 
our case, that would map to a tensor consisting of the ranks up to and including the rank of the fiber of interest. 
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(a) Sparse tensor-matrix. Blank cells represent an empty point in the tensor 




















(b) Sparse tensor-tree. All coordinates with no payload are dropped from the tree. Thus 
coordinates in rank W corresponding to empty cells are dropped and, if all children 
of a coordinate in rank H have been dropped, that coordinate is dropped as well 


Figure 8.18: Sparse tensor as matrix and tree. 


Figure 8.19 illustrates this ffher-żree representation. The figure is interpreted as follows: the 
root of the tree (black diamond marked “R”) points at the single fiber (solid oval) in the top (or 
highest) rank (dotted oval “H”) of the tensor. Rank H’s fiber holds the non-empty coordinates 
at that rank. Each coordinate in rank H’s fiber has a payload that is a reference (black line) 
to a fiber (solid oval) at the next lower rank (dotted oval “W”). In this case, rank W contains 
two fibers. For a higher dimension tensor, the interpretation would continue recursively in that 
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Figure 8.19: Sparse Fiber Tree. Each level of the tree corresponds to a rank rank that contains 
one or more fibers. Each fiber contains a set of coordinates, whose payload is either another fiber 
at the next lower rank or a value at the bottom of the tree. Note, fibers at the lower rank only 
have coordinates for non-empty values, and empty fibers are dropped from all levels of the tree. 


pattern until the lowest rank (rank W in this case), where the payload of each coordinate is 
a value (blue square) marked with letters that represent scalars or other terminal data values. 
Using this abstraction we can consider a variety of concrete representations of a tensor and their 
efficiency in space and time. 

In summary, the terminology we will use is as follows: 


e Rank - an axis, such as a row, column, or higher dimension, of a tensor. In our repre- 
sentation, ranks are labeled with a name, called a rank-id (e.g., H and W in Figure 8.19). 
Furthermore, in our abstract tensor representation, ranks are ordered from top (highest 
rank) to bottom (lowest rank). 


Coordinate - an identifier associated with each element (or item) contained in rank of a 
tensor. For example, the numbers in the brown circles in Figure 8.19. 


Payload - the value associated with a particular coordinate in a rank. In Figure 8.19, ref- 
erences to a payload are indicated by the black lines (edges) in the tree. For intermediate 
ranks, the payload of a coordinate will be a fiber that corresponds to a sub-tensor (of the 
lower ranks). For example in Figure 8.19, the coordinate 2 in rank H has a fiber in rank 
W as its payload, so coordinate 2’s payload corresponds to 1-D sub-tensor whose top fiber 
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has coordinates 0 and 1. The payloads of the coordinates in the lowest rank will be a simple 
value (e.g., a number). 


Point - a set of coordinates (one for each rank of the tensor) that localizes a single value 
in the tensor. The order of the coordinates match the order of the ranks (from top to 
bottom in the fiber tree). For example, the value “c” (blue square) in Figure 8.19 is at point 
(h, w) = (0, 2). 


Fiber - a set of coordinates and their associated payloads. All of the coordinates in a fiber 
are children of a single coordinate in the rank above. Therefore, a rank will contain one or 
more fibers and each fiber in a rank will be the payload associated with a coordinate of the 
rank above. 


8.2.2 CLASSIFICATION OF TENSOR REPRESENTATIONS 


In general, there are a variety of characteristics of the concrete representation (or format) in 
memory of a tensor that will affect its desirability for use in a particular DNN accelerator design. 
Since the fundamental component of our tensor abstraction is a fiber, our focus will initially be on 
concrete representations of fibers, and on the tradeoffs among representations based on factors 
such as how much space they occupy and the complexity of the operations that access or operate 
on them. 

In practice, a key operation on a fiber is to Jookup a payload associated with a specific 
coordinate. That will generally mean finding the address (in physical storage) of the payload 
associated with the given coordinate. Therefore, finding the value at a point in the tensor will 
nominally require a series of such lookups traveling down the tree. That series of lookups will 
locate (e.g., by address) the fibers at the intermediate ranks and ultimately the final payload, 
which will contain the value of the desired point. 

Obviously, the efficiency of a specific fiber representation, as characterized by metrics 
such as size and speed of payload lookup, will depend on the in-storage layout of the fiber 
and the algorithmic choices available to traverse that layout. Hardware DNN implementations 
need to consider the implementation costs and tradeoffs among these metrics in order choose 
the representation that best optimizes the design. From this perspective, a wide variety of fiber 
representations have been proposed for use in software and hardware, and can be classified in a 
variety of ways. 


Explicit versus Implicit Coordinate Representations 
Fiber representations can be classified by whether the values of the coordinates are manifest 
explicitly in the representation or not. These will be referred to as explicit or implicit coordinate 
representations, respectively. 

An example of an implicit coordinate representation is an implementation of a fiber as a 
standard 1-D array data structure. An example is shown in Figure 8.20a. 
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Figure 8.20: Uncompressed and compressed fiber representations. Subfigure (a) shows an array- 
style representation, where the payloads (blue rectangles) and zero values (blank rectangles) oc- 
cupy all the elements of the array. Below each element is a number indicating its position in 
the array and the same number corresponds to the element’s (implicit) coordinate. Subfigure (b) 
shows a coordinate/payload representation where each position holds a tuple containing each 
non-zero element’s coordinate (brown rectangle) and value (blue rectangle). 


For fibers represented in an array-style representation, the coordinate never appears directly 
in memory, but are determined directly as an offset to a specific payload in the list of payloads. In 
general, we will refer to the offset (or relative location in memory) of each element in a fiber as the 
element’s position. Payload lookup by coordinate generally needs to find the position of a payload 
with the given coordinate. In an array-style representation, payload lookup by position would be 
very efficient, since the position can be determined directly from the coordinate. Unfortunately, 
for sparse fibers the space required to hold the fiber could be quite large because even empty 
elements of the fiber occupy space. 

An explicit coordinate representation, where the coordinate values actually appear in mem- 
ory, can be useful to save space for very sparse fibers. For example, a list of coordinate/payload 
tuples for non-empty coordinates would occupy much less space than an uncompressed array 
of payloads. An example of a fiber represented as such a coordinate/payload list is shown in Fig- 
ure 8.20b. 

When using a coordinate/payload list representation, one can see that if a coordinate in a 
high-level fiber is empty (i.e., indicative that an entire tile of the tensor is zero) a considerable 
amount of space at all the lower levels is saved. Such an explicit coordinate representation, how- 
ever, would have more costly payload lookups because although each coordinate has a position 
in the fiber there is not a direct mapping from coordinate to position. Thus, a more complex 
lookup (e.g., via a binary search) would be required. This creates a tradeoff between space (large 
for the array-style representation) and speed/energy (more memory accesses for the binary search 
needed by lookup in a coordinate/payload representation). However, sometimes the space sav- 
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Figure 8.21: Examples of implicit coordinate-style methods to compress a fiber. Blue boxes 
are non-zero values, white boxes are zero values, and grey boxes are encoding metadata. In 
subfigure (b), the metadata contains the number of zero values between two non-zero values. 
In subfigure (c), the metadata is a bitmask where each one in the bitmask indicates that the 
corresponding coordinate is non-zero and counting ones to its left indicates the value’s position. 


ings can be worthwhile, and such a coordinate/payload list representation is used as a component 
of a variety of well-known tensor representations, such as compressed sparse row (CSR) [277], 
compressed sparse column (CSC) [278], and compressed sparse fiber (CSF) [275]. Additional 


characterizations of these tensor formats will be presented in Section 8.2.5. 


Compressed versus Uncompressed Representations 
Fiber representations can also be classified by whether the size of the fiber in memory is a func- 
tion of its sparsity. Or, in other words, is the fiber compressed or not. 

Clearly, the (implicit coordinate) 1-D array-style representation of a fiber described above 
is an uncompressed representation because the space occupied is not a function of sparsity. How- 
ever, one can enhance it to create a compressed representation by adding metadata that records 
where there are empty values. This includes any of the myriad run-length encoding (RLE) 
schemes,!! where payloads are interspersed with information about the number of empty values 
between non-empty values (see Figure 8.21b). Empty values can be identified in other ways, 
such as with a bit-mask (see Figure 8.21c). These compressed representations can reduce storage 
and data movement costs as was done in the Eyeriss design [101]. In Eyeriss, the volume of 
partial sum data transmitted to/from DRAM and the size of the data stored there was reduced 
using an RLE compression scheme. 

Unfortunately, compression schemes are not a panacea as there are tradeoffs that need to 
be considered when applying a compression scheme. For example, consider the number of bits 
allocated to describe gaps between non-empty values in a RLE scheme. In a very sparse fiber, 
one would like a large number of bits to express a large gap between non-zero values (or spam) to 
always be able to jump directly to the next value. Otherwise, one would have to introduce zeros 
as concrete values to serve as a stepping stone to the next non-zero value. On the other hand, 
if the fiber is not so sparse, then the bits allocated to hold a large span could be a significant 
overhead. Such choices are a function of statistical properties of the data, which may or may not 
be known a priori. Thus, these choices can have a significant impact on the flexibility metrics of 


These schemes date back to very early inspiration by Shannon [279]. 
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the design, as discussed in Section 3.5. Adding these kinds of compression also impacts lookup 
because there is a more complex relationship between a value’s coordinate and its position. 

Explicit coordinate schemes can also be a compressed representation. For example, one 
could implement coordinate to payload mappings by putting the payloads in a hash table. In such 
a scheme, a coordinate could be used as the key into the hash table that returns the payload for 
that coordinate. Because hash table sizes are a function of the number of values contained (i.e., 
the non-empty coordinates), a compression benefit will accrue. Furthermore, note that within 
a rank each fiber could have its own hash table or there could be a single table associated with 
all the fibers in the entire rank. Between those two schemes the keys needed for the hash table 
lookups differ. The former scheme just needs a coordinate since each fiber has its own hash table, 
while the latter needs a fiber-id and a coordinate because the single table has the coordinates 
for multiple fibers. Therefore, depending on the hash table organization used by a rank, the 
information returned by a lookup operation on the rank above will differ. 


8.2.3 REPRESENTATION OF PAYLOADS 


The hash table alternatives for representing the fibers in a rank described in Section 8.2.2 il- 
lustrate the fact that the information in the payloads used in a rank is dependent on the fiber 
representation used in the next rank lower in the tree. In the fiber-tree abstraction in Figure 8.19, 
this corresponds to implementation of the edges (black lines) that connect a coordinate to its 
payload. 

Like coordinates, these payloads can be explicit or implicit. In the case of the hash table 
per fiber representation, the explicit payload at the rank above in the tree is just a direct reference 
to the hash table for the appropriate fiber. Such a direct reference is a very common form of 
payload, but differs from what is needed for the hash table per rank representation, where the 
(explicit) payload is a fiber-id that is used as part of the hash table lookup. 

A payload can be implict such as when the position in a coordinate/payload list can be 
interpreted directly as an offset into information at the next lower rank. In that case, that position 
constitutes an implicit payload that occupies no space. 

In some cases, either coordinates, payloads, or both can be compressed by using an infor- 
mation theoretic compression scheme, such as a Hamming code. Note, however, that this can 
introduce complexities for implicit payload schemes or complicate lookup operations. 


8.2.4 REPRESENTATION OPTIMIZATIONS 


So far, we have exclusively considered representations where each level of the tree corresponds 
to exactly one rank of the tensor. In this case, the cost of finding a value (i.e., a payload at a leaf 
of the tree) is a direct function of the number of ranks in the tensor. Fortunately, it is possible 
to consider representations with multiple ranks combined together in a level. 

Conceptually, having two ranks in a single level means combining two consecutive ranks 
into one. After this combination, the coordinates of the new rank are tuples of the coordinates 


8.2. COMPRESSION 197 




















Figure 8.22: Effect of flattening ranks. Two ranks H and W are flattened into one rank and the 
coordinates of the new rank (H,W) are tuples of the original coordinates. 


from the two original ranks. This processes is referred to as fattening ranks and is illustrated for 
a fiber-tree in Figure 8.22. 

The implementation of flattening varies with the representations chosen for the individ- 
ual ranks. For fibers represented in uncompressed array-style, it should be clear that two ranks 
can easily be fattened into a single level. The resulting representation is still uncompressed and 
lookup involves two coordinates and is implemented with a simple arithmetic calculation. For 
a 2-D tensor with ranks H and W, the calculation for lookup by coordinates h and w would 
be h x |W_rank| + w. This is a standard way that multi-dimensional arrays are represented in 
software. Flattened compressed representations are also possible, although lookup by coordinate 
might be serial. Efficient (binary search) lookup can, however, be maintained when combining 
a pair of ranks in coordinate/payload list format where the two ranks can be combined into a 
single level and the coordinates become a tuple of the coordinates from the two original ranks. 
Similarly, two ranks in hash table format can be combined into a single level with a key lookup 
by coordinate tuples. 
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Table 8.1: Fiber representations 


Description Coordinates | Compressed Example 
Uncompressed array Implicit DianNao [150] 
Run-length-encoded (RLE) stream Implicit Eyeriss [98] 








Bitmask of non-zero coordinates Implicit SparTen [277] 
Coordinate/payload list Explicit SCNN [156] 
Hash table per fiber Explicit 

















Hash table per rank Explicit 











8.2.5 TENSOR REPRESENTATION NOTATION 


A small table of fiber representation schemes described in the preceding sections is shown in 
Table 8.1. The table includes a short label and description for each scheme, its characteristics and 
an example design that uses it. Other (possibly flattened) fiber representations can be assigned 
their own unique labels. However, when two levels of fiber with known representations are 
flattened, we use a notation like U?, P?, or (RU) to describe the flattened combination. 

Given the above fiber representations, the combination of all the choices for representing 
a fiber’s coordinates and payloads leads to a large number of implementation choices. This is 
multiplied by the fact that each rank of a tensor might use a different representation. 

To provide a specific example, we will show how the well-known compressed sparse row 
(CSR) format [277] can be represented as a concrete representation of a fiber tree. Figure 8.23 
illustrates this by showing the matrix of Figure 8.18 in CSR format. The figure shows CSR as a 
two rank tree that uses an uncompressed array as it top rank fiber (H), and a coordinate/payload 
list as its bottom rank (W). Thus, the rows are compressed. 

In CSR, each position in the upper rank (which is also its coordinate since it is uncom- 
pressed) has a payload consisting of a open range that points at a fiber in the bottom rank.’ 
And each fiber in lower rank consists of a list of explicit coordinates each of whose position is 
an implicit payload that is the position of the value in the value array. 

So in the figure, if we want to find the value at coordinate (2,1) we start by looking at 
position (and coordinate) 2 of the upper rank and find its payload is the open range [2, 4). Note 
how the open range is cleverly encoded with information from two successive positions in the 
array. Then looking at the fiber in the bottom rank at positions 2 and 3 (i.e., in the open range 
[2,4)), we search for the coordinate 1. Finding that it is at position 3, we know that the value 
for coordinate (2,1) is the value “h” at position 3 in the value array. 


Note, that the CSR representation depends on the fibers in the lower rank being consecutive in memory so the payloads 
of the upper rank can point to a position in the lower rank. 


8.2. COMPRESSION 199 












Fiber 


























Payload 





Figure 8.23: Compressed Sparse Row (CSR) format as a concrete fiber-tree representation. CSR 
is implemented as two ranks, the top rank is an uncompressed array in memory with payloads 
consisting of a pointer to a fiber in the next rank (in open range form). The fibers in the lower rank 
are coordinate/payload lists that are concatentated together in memory. In the diagram, positions 
in memory for each element of a fiber are indicated with black numbers next to a shape. 


The CSC representation is the dual of CSR and is basically the same just with the rank 
order reversed. These schemes statically pick a representation per rank, but an even more complex 
approach would be to dynamically choose the representation used for each fiber. That choice 
could be made at the rank level, so that the rank would have a tag indicating the representation 
for all its fibers or the choice could be made at the individual fiber level. 

Noticing that many previously proposed tensor representation have been created by se- 
lecting a fiber representation for each rank (or flattened set of ranks) of the tensor leads to the 
idea of a more generic specification of a tensor representation. With this objective in mind, a 
full specification of tensor would require a selection for each rank of (1) a fiber representation 
and (2) a rank-id. Combining concepts from [274] and [276] we use the following notation to 
represent the specification of a tensor: 


Tensor < FIBER-REPRESENTATION-REGEX > (RANK-ID...), (8.3) 


where FIBER-REPRESENTATION-REGEX is a regular expression consisting of a se- 
quence of labels of fiber representations from Table 8.1, and RANK-ID is the name of a rank. 
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Table 8.2: Tensor representations 


Specification 
Tomson <U> (s) Standard multi-dimensional array 
Tensor <UC > (H,W) Compressed sparse row (CSR) [274] 








Tensor <UC > (W,H) Compressed sparse column (CSC) [275] 
Tensor <C +> (...) Compressed sparse fiber (CSF) [272] 
Tensor <C™> (...) Coordinate format (COO) [274] 














Table 8.2 lists some common tensor representations in this notation and their common name. 
Note, the actual rank-ids are only used for the rows for CSR and CSC because those represen- 
tations only differ in the names they assign to the ranks. 

DNN accelerators need to make design decisions on what representations they will em- 
ploy based on the how they affect the metrics of the design. However, those decisions must 
be considered in conjunction with the computation sequencing, which is described in the next 
section. 


8.3 SPARSE DATAFLOW 


In Section 8.2, we discussed the opportunity that sparsity presents to compress the tensors used 
in DNN computation. This provides obvious benefits savings in storage space, access energy costs 
and data movement costs by storing and moving compressed data. However, we also recognize 
that sparsity means that individual values of activations or weights (or entire multi-dimensional 
tiles) are zero and multiplication of anything by zero is zero and furthermore addition with zero 
simply preserves the other input operand. Such operations therefore become ineffectual (i.e., do- 
ing the operation had no effect on the result). As a consequence, when performing the pervasive 
sum of product operations (i.e., dot products) in DNN computations there is an opportunity to 
exploit these ineffectual operations. 

‘The simplest way to exploit ineffectual operations is to save accessing operands and avoid 
executing the multiplication when an operand is zero. The Eyeriss design saved energy by avoid- 
ing reading operand values and running the multiplier when an activation was zero [160]. That 
eliminated activity (in energy) includes the accesses to input operands, writes/updates to output 
operands and arithmetic computation so it can enhance the energy efficiency/power consump- 
tion metrics discussed in Section 3.3. However, this only saved energy, not time, for ineffectual 
operations. 

When the hardware can recognize the zeros in the products terms, the amount of time 
spent performing the dot product can be reduced by eliminating all the time spent on ineffectual 
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Figure 8.24: Distribution of weight and activation density. (Figure adapted from [159].) 


activity related to product terms with any zeros as an operand. This will not only improve energy, 
but improve operations per cycle, as described in Equation (3.2). 

Figure 8.24 gives an indication of the potential for reducing computation time by showing 
the density (proportion of non-zeros or 1 - sparsity) in both weights and input activations for the 
layers of VGGNet. From those statistics, an architecture that could optimally exploit weight or 
activation sparsity could provide speedups of 2 to 5x. If one assumes that weight and activation 
sparsity are statistically independent, a savings would be accrued whenever at least one operand 
is zero, resulting in potential speedups of over 10x. 

In Section 8.1.2, we noted that specialized DNN hardware is generally applied when 
trying to exploit fine-grained, unstructured sparsity in weights or any sparsity in activations. So 
this section will generally focus on techniques targeted at fine-grained, unstructured sparsity, but 
the concepts described and notation used are generally applicable to any type of sparsity. In any 
case, supporting sparsity means that additional hardware will be used to identify non-zero values 
and for looking up payloads, which adds energy and area overheads. Therefore, the challenge is 
creating an efficient design that achieves savings and does so across a range of sparsities, such that 
the benefits of exploiting sparsity exceeds the cost of the additional hardware required to identify 
the non-zero data; achieving gains is particularly challenging when the amount of sparsity is low, 
fine-grained, and unstructured. 

There are two major aspects involved in the challenge of exploiting sparsity to reduce 
computation time in DNN computation: the first is choosing an optimal representation for the 
sparse data. The second is choosing a computation flow. These two aspects must be co-designed 
in order to achieve an overall optimal design. However, considering them simultaneously can 
be confusing, so we will use the fiber-tree notation presented in Section 8.2.1 to allow their 
consideration separately. 

The challenge of choosing a computation flow for a sparse DNN accelerator is somewhat 
analogous to the challenge of maximizing reuse, as discussed in Chapter 5. Just as Chapter 5 de- 
scribed the opportunities for achieving different forms of reuse, and used the notion of different 
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Design 8.4 1-D Weight-Stationary Convolution Dataflow with Dense Weights & Activations 





1 i = Array(W) # Uncompressed input activations 
2 f = Array(S) # Uncompressed filter weights 

3 o = Array(Q) # Uncompressed output activations 
5 for s in [0, S): 

6 for win [s, W-s): 

7 g=w-—s 

8 o[q] += i[w] * f[s] 





dataflows to explore that domain of reuse opportunities, there are a variety of options and often 
tradeoffs in the amount of sparsity that can be exploited. And again dataflows can be described 
that allow one to express those options.’ In the next several sections, we will begin with a re- 
view of a dataflow for dense convolution and proceed to describe dataflows for exploiting sparse 
weights, or sparse activations and then both for convolutions. Finally, we will present dataflows 
for fully-connected computations with both sparse weights and activations. 

In Chapter 5, the different dataflows were described as different loop nests. A standard 
output-stationary dataflow is shown as Design 8.4. In lines 1-3, the Array(...) declarations 
correspond to a tensor declared as Tensor < U” > (...) in the notation of Section 8.2.5. Lines 
5 and 6 show the for loops that create an index variable used to traverse the DNN tensors 
(weights, input and output activations). Those for loops traverse an open range of values (e.g, 
[0, S) iterates over the values 0 to S-1). Those index variables are used to directly index into 
the arrays containing the tensors. In the terminology of Section 8.2, those index variables hold 
coordinates and the arrays correspond to an uncompressed (often flattened—see Section 8.2.4) 
data representation for the tensor. Note that we follow the convention that a small letter (e.g., s) 
is a coordinate in the rank of the corresponding capital letter (e.g., S) and that same capital letter 
indicates that the coordinates occupy the open interval (e.g., [0,S)). For these uncompressed 
representations, it is easy to express payload lookups using standard array access notation (e.g., 
o[q]), since the coordinates equal the position in the array (line 8). 

For sparse data in a compressed representation, the behavior for a particular dataflow 
(i.e., loop nest) can be replicated by replacing all the array accesses with a getPayload() lookup 
by coordinate (or set of coordinates). A weight-stationary example of this approach is shown 
in Design 8.5. In that dataflow the Tensor () declarations follow the notation of Section 8.2 
except the specific fiber representations are omitted, but would need to be selected by co-design 
with the dataflow for an actual implementation. 


13Note that in Chapter 5 on dataflows, the focus was on the fundamental ordering of computations, while important issues 
such as tiling (Chapter 4) and mapping (Chapter 6) were left to be considered independently. This chapter will largely do the 


same. 
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Design 8.5 1-D Weight-Stationary Convolution Dataflow with lookup of Sparse Weights & 
Activations 





i = Tensor(W) # Compressed input activations 
f = Tensor(S) # Compressed filter weights 
o = Array(Q) # Uncompressed output activations 


for s in [0, S): 
for win [s, W-s): 
q=w-s 


o[q] += i.getPayload(w) * f.getPayload(s) 





This approach, however, has two significant drawbacks: (1) it involves iterating over al 
the coordinates in each fiber of the tensors, and therefore there will be no time or energy savings 
with respect to operand accesses; and (2) the cost of individual payload lookup given a point (i.e., 
set of coordinates) can be very expensive, since it might involve a traversal of the fiber tree - one 
level per coordinate. Fortunately, there are better dataflows that create considerable regularity in 
the access pattern, and therefore that inefficiency can be ameliorated. 

‘The cost of looking up a single payload in a tensor by its coordinates, which was described 
in Section 8.2, can be quite expensive. Fortunately, however, that is often not the key metric of 
interest, because it is very common for the payloads in a fiber to be traversed in coordinate order, 
i.e., in order of monotonically increasing (non-empty) coordinates. As we will see below, this 
can be true for DNN accelerators that attempt to exploit sparsity. 

Coordinate order traversal for many fiber representations, such as both the array-style 
representation and the coordinate/payload list representations, can be very efficient, often by 
employing a mechanism that remembers the current position in the fiber. For a multi-rank ten- 
sor, i.e., a tree with fibers as the payloads in the intermediate levels, a traversal of the tree in 
the order of the ranks in the tree would be correspondingly efficient. Such a traversal follows a 
depth-first traversal of the tree. We will refer to this highly desirable path through the tree as a 
concordant traversal of the tensor. Conversely, the less desirable path through the tensor where 
ranks are traveled in a different order than they appear in the tree is referred to as a discordant 
traversal. 1 

In many cases, a discordant traversal would require many distinct (possibly indirect) pay- 
load lookup operations, and its cost would be a function of the details of the implementation 
(e.g., caching could help). However, some representations support reasonably efficient discor- 
dant traversal, such as a flattened uncompressed representation, which sacrifices spatial locality, 
but does not require indirect references. The balancing of the tensor representation, efficiency 


14This terminology was coined by Michael Pellauer as part of the Symphony sparse computation accelerator project as 
words that have connotations that correspond to the characteristics of the traversals and have a musical allusion. 
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Design 8.6 Sparse 1-D Summation 





1 t = Tensor(H) 


2 sum = 0 


4 for (h, t_val) in t: 
5 sum += t_val 





of traversal, and the application of concordant or discordant traversal are critical design consid- 
erations for sparse DNN accelerators. 

Most proposed DNN accelerators designed to exploit sparsity employ concordant traversal 
of the sparse input activation and/or weight tensors. Such a traversal generates a series of points 
(i.e., coordinate tuples) and the value at each of those points. The sequencing and use of the 
values generated by such traversals for all the tensors in a DNN computation correspond to a 
dataflow of the accelerator. Although important for understanding the ultimate design metrics 
of a design, we believe the detailed representation of the tensor is not necessary to understand 
the dataflow of the accelerators. And one can even get rough behavioral characterizations by 
just counting the lengths of each concordant traversal and number of times each type operation 
is performed on each tensor. Therefore, we can use traversals of (and operations on) the sparse 
tensor abstraction presented in Section 8.2 to express dataflows of sparse DNN accelerators. 

Given the desire to express dataflows that operate on sparse data, we employ a Python-like 
language with iteration operators to traverse a sparse tensor. As an example, in Design 8.6 we 
illustrate summing all the elements of a sparse 1-D tensor. The first line is a declaration of a 
1-D tensor with a rank named H. Since the sum only needs to consider the non-zero elements 
of the tensor, we will express iteration over just the non-zero elements of a 1-D tensor using 
a for loop. This is shown in line 4 by a loop that iterates over tensor t. In reality, an iteration 
over a tensor implies a iteration over the elements in the topmost (and in this case, only) fiber 
of the tensor. Each step of the iteration returns a tuple consisting of the coordinate of the next 
non-empty element, h, and its payload, t_val, which for a 1-D tensor is a non-zero scalar value 
at coordinate h. Accumulating those values is performed in line 5. Assuming the concordant 
traversal of the fiber is efficient, this code represents the accumulation with efficient accesses to 
just the non-empty elements of the tensor, and its performance should be proportional to the 
length of the tensor t. 

Design 8.7 extends the above simple example to a summation of all the elements of a 2-D 
tensor, which is declared in line 1 with ranks named “H” and “W”. Now the payloads of the 
first traversal are fibers from the second rank, i.e., the W rank. Therefore the for loop in line 4 
returns a coordinate h from the H rank and a fiber (i.e., the payload at that coordinate). The fiber 
is named t_w to represent the fact that is a fiber of the W rank of the tensor t. The for loop in 
line 5 traverses the non-empty elements of that fiber returning a coordinate w from the W rank 
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Design 8.7 Sparse 2-D Summation 





t = Tensor(H, W) 


sum = 0 


for (h, t_w) in t: 
for (w, t_val) in t_w: 
sum += t_val 





and a non-zero value at the coordinate (h,w) (i.e., the payload at that coordinate). The value is 
named t_val to indicate that it is a value of the tensor t, and is combined into the sum in line 
6. 

By considering the for loops in this dataflow, the performance can be estimated to be 
proportional to the product of the length of the fiber in the top rank (H) of t and the average 
length of the fibers in the bottom rank (W) of t. Similar performance estimates can be made 
for the other dataflows presented in this section. 

Given this notation, which focuses on the dataflow without the complexity of dealing 
with a specific tensor representation, we can succinctly express a variety of dataflows that exploit 
sparsity. The following sections explore various dataflows that are designed to exploit sparsity 
in weights and/or activations for both CONV and FC layers. In those sections, we will both 
describe designs in terms of loop nests and block diagrams whose structure is implied by the 
loop nests. 

Figure 8.25 shows a key for the components used in the design block diagrams. The block 
diagrams generally illustrate a single storage level design that process untiled computations. 
Therefore, the storage elements are assumed to hold the entire data sets. In actual designs, 
those storage elements would be implemented using one of the hierarchical buffering schemes 
described in Section 5.8, such as caches, scratchpads, or explicit decoupled data orchestration 
(EDDO) units, like buffets. 

Table 8.3 gives a roadmap of the dataflows that will be explored. 


8.3.1 EXPLOITING SPARSE WEIGHTS 


A very natural opportunity to exploit sparsity in a DNN accelerators is to save time for zero 
weights, such as those provided by weight pruning, as described in Section 8.1.2. An advantage 
of dealing with sparse weights (as opposed to activations) is that they can be known statically and 
their compressed representation can be generated once (possibly offline in software) alleviating 
the hardware of the responsibility and costs of doing that compression. 

Just as with dense computations described in Chapter 5, there are different dataflow op- 
tions for processing a convolution with sparse weights. Design 8.8 illustrates a weight-stationary 
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(a) Uncompressed data buffer: 
Buffer that accepts a coordinate 
via a network link (brown line) 
and returns via a network link 
(black double line) the value 
(i.e, the payload) for that 
coordinate by reading the value 
at that coordinate. Recall that 
position == coordinate when the 
tensor is uncompressed). 





Coordinate 
(Cgen): Finite state machine 


(d) generator 
configured to generate a 
sequence of coordinates as well 
as a control signal (dashed 
black line) that indicates a 
breakpoint in a sequence of 
outputs. The sequences gener- 
ated by Cgen correspond to the 
index variables in the for 
loops in a loop nest. The config- 
uration of the coordinate 
generators is not expressed in 
the diagrams. 


old] 








Update 


ofq] 


Coord Payload ~~ 


Partial Sums 
(b) Updating uncompressed 
data buffer: Enhanced uncom- 
pressed data buffer that also 
updates the value at the speci- 
fied coordinate with a new 
value when that value arrives. 
Note, these designs wait for the 
update before accepting the 
next coordinate. Deeper 
pipelining optimizations are 
feasible, but ignored here. 
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(e) Latch: A one element buffer 
that holds a value or coordi- 
nate/value pair. There is a 
control signal (dashed line) that 
tells when to latch a new value. 
A “stationary” value in a 
dataflow will typically be held 
in a latch. 
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(c) Compressed data buffer: 
Buffer that accepts a position via 
a network link (orange line) 
and returns both the coordinate 
and a value (i.e., the payload) at 
that coordinate. The dotted 
border of the box indicates that 
the data is sparse. 
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(f) Position generator (Pgen): 
Finite state machine 
programmed to operate like the 
Cgen, but generates positions 
(orange line) instead of coordi- 
nates. 


Figure 8.25: Key for structures in block diagrams of sparse architecture dataflows. (Continues.) 


(g) Coordinate calculator: A 
small arithmetic unit for doing 
coordinate calculations, such as 
simple additions and subtrac- 
tions. 
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(h) Multiply accumulate unit 
(MAC): Arithmetic unit that 
takes in three values (double 
black lines) multiplies two of 
the values together and adds 
the product to the third value 
and returns the result. 
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(i) Intersection unit: Unit that 
takes in two streams of coordi- 
nate/value pairs and generates a 
stream of coordinates and two 
streams of values only when the 
coordinates from the two input 
streams match. 
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Figure 8.25: (Continued.) Key for structures in block diagrams of sparse architecture dataflows. 


Table 8.3: Sparse dataflow roadmap 


Examples 

Cambricon-X [278] 

Cnvlutin [225] 

SCNN [156], SparTen [277], Eyeriss V2 
[158] 


Section | Dataflow Description 


Convolution with sparse weights 





Convolution with sparse activations 





Convolution with sparse weights and 


activations 





Fully connected with sparse weights and 


EIE [279], ExTensor [273] 





activations 








dataflow that exploits sparse weights. In that dataflow, the input and output feature maps are 
assumed dense and uncompressed, so they are represented as a standard Array and lookup by 
coordinate uses the standard array access operator ([]). On the other hand, the weights are as- 
sumed sparse and compressed, so are represented by the fiber-tree tensor abstraction, but are 
assumed to be in a design-specific compressed representation that allows efficient concordant 
traversal of its fibers. 

Computation of the weight-stationary 1-D convolution in Design 8.8 proceeds by access- 
ing each non-zero filter weight via the for loop in line 5 and holding it stationary, while the for 
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2 


2 


Design 8.8 1-D Weight-Stationary Convolution Dataflow - Sparse Weights 
gn g y. P g 





i = Array(W) # Uncompressed input activations 
f = Tensor(S) # Compressed filter weights 
o = Array(Q) # Uncompressed output activations 


for (s, f val) in f: 
for q in [0, Q): 
w=q+s 


o[lq] += ilw] * f val 





loop in line 6 traverses the output locations. Line 7 calculates the coordinate of the input activa- 
tion needed for this step using the current output coordinate, q, and the filter weight coordinate, 
s. The computation in line 8 performs the computation of partial sums by direct accesses to the 
input activations (i []) and partial sums (o[]) by coordinate. The filter weight itself, f_va1, is 
provided by the traversal of the sparse filter weight tensor. 

A block diagram of the 1-D weight-stationary convolution dataflow with sparse weights 
is shown in Figure 8.26. A notable feature of the diagram are the latch that holds the “stationary” 
weight and its coordinate, while successive non-zero weights are blocked until the latch grabs 
a new value. That is controlled by the “Next” signal from the partial sum coordinate generator, 
which is sent on each of the |f[] | passes through the output partial sums. 1° 

An advantage of the weight-stationary dataflow for sparse weights is that any complexity 
in accessing the compressed weights can be buried under the multiple iterations through the 
output activations in the inner loop. On the other hand, the output-stationary dataflow for sparse 
weights in Design 8.9 allows the hardware to accumulate partial sums into a single register, thus 
avoiding the access costs required for repeated reads and writes into a larger storage array. 

A block diagram of the 1-D output-stationary convolution dataflow with sparse weights 
is shown in Figure 8.27. Notable features are the partial sum latch which holds partial sums 
“stationary” until the final sum is sent to the partial sum buffer. 

‘The performance of either dataflow can be estimated to be proportional to product of the 
number of output activations (Q) and the number of non-zero weights (i.e., the length of the 
rank S fiber in the filter weight tensor (f)). That highlights the fact that neither dataflow provides 


any time savings benefits when an activation (as opposed to a weight) is zero. 


Cambricon-X 
A DNN accelerator architecture that attempts to exploit weight sparsity is Cambricon-X [281]. 
Cambricon-X is a weight-stationary dataflow for sparse weights enhanced from the above simple 


We are using the absolute value notation |f [] | to represent the number of valid positions (i.e., non-zero values) in the 
tensor f. 
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Figure 8.26: 1-D Weight-Stationary Convolution Dataflow — Sparse Weights: in this design, 
the filter weight Pgen (outer loop) is configured to generate the sequence of positions [0, | f []|) 
exactly once. The partial sum Cgen (inner loop) is configured to generate the coordinate sequence 
[0, Q) one time for each non-zero filter weight (i.e., | f'[]|) times). The weights (f[s]) and their 
coordinates (s) are held “stationary” in the latch, which is filled each time the partial sum Cgen 
starts a sequence with the “Next” control line from partial sum Cgen. That weight and the 
appropriate input activation are multiplied and added to the sequence of partial sums (o[q]), 
which are updated with the output of the MAC. 


examples in a variety of ways. One way it is enhanced is through additional parallelism, which 
is manifest by both working on multiple weights in parallel and working on multiple output 
activations in parallel. 

A sample dataflow for a 1-D convolution illustrating parallelism for weights and outputs 
is shown in Design 8.10. In line 5, the weight fiber is split into subfibers of length 2 (i.e., fibers 
each with 2 coordinates) by the sp1itEqual () method. ‘The effect of such splitting is illustrated 
in Figure 8.28 where a fiber with six sparse coordinates is split into groups of two by grabbing 
coordinate/payload pairs from consecutive positions in the original fiber. Because the criteria 
for splitting is based on positions, this is referred to as position-space splitting.'° After the split, 
the original coordinates from the original rank (S) are preserved in the lower rank (SO), and 


16 A variety of other splitting semantics exist, such as splitting in the original coordinate space and/or splitting into unequal 
pieces. 
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Design 8.9 1-D Output-Stationary Convolution Dataflow - Sparse Weights 





1 i = Array(W) # Uncompressed input activations 
2 f = Tensor(S) # Compressed filter weights 

3 o = Array(Q) # Uncompressed output activations 
5 for q in [0, Q): 

6 for (s, f val) in f: 

7 w=qts 

8 olq] += ilw] * f val 





Design 8.10 1-D Parallel Weight-Stationary Convolution Dataflow - Sparse Weights 





1 i = Array(W) # Uncompressed input activations 
2 f = Tensor(S) # Compressed filter weights 

3 o = Array(Q) # Uncompressed output activations 
5 for (s1, f_ split ) in f. splitEqual (2): 

6 for q1 in [0, Q/4): 

7 parallel—for (s0, f_val) in f_ split 

8 parallel—for q0 in [0, 4) 

9 s = sO 

0 q = q1*4 + q0 

1 w=qts 

2 o[q] += iw] * f_val 





coordinates in the upper rank (S1) just indicate the group number. Note, there are many choices 
for representing a split fiber, and if one just adds a small amount of meta-data referring to the 
original fiber that may be a superior design choice than adding a completely new rank and 
creating new split fibers. 

Returning to the dataflow, after the split a parallel-for in line 7 operates on each of 
the elements in that split sub-fiber (f_sp1it) in parallel. In lines 6 and 8, the indices of the un- 
compressed output activations are partitioned (in position space) into groups of 4 for additional 
parallel execution. Therefore, in total this dataflow provides eight-way (2 x 4) parallelism. 

An actual sparse DNN accelerator, such as Cambricon-X, would need to consider a num- 
ber of additional factors. Factors in common with dense DNN accelerators include the need to 
expand each of the input and output operands to the full multi-dimension tensors of standard 
DNN computations (e.g., multiple input and output channels). It also needs to consider multi- 
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Figure 8.27: 1-D Output-Stationary Convolution Dataflow — Sparse Weights: in this design, 
the partial sum Cgen (outer loop) is configured to generate the coordinate sequence [0, Q) ex- 
actly one time. The filter weight Pgen (inner loop) generates the sequence of positions [0, | f []|) 
repeatedly Q times. The partial sums (o[q]) are held “stationary” in the latch, which is latched 
with the initial partial sum each time the Pgen starts a sequence with the “Next” control line 
from filter weight Pgen. That control signal also sends the final partial sum back to the buffer 
through the demultiplexer. MAC operations are performed for each non-zero weight (f[s]) and 
the appropriate input activation (i[q+s]) and added to the current partial sum o[q] creating a 
new partial sum (o|q]). 


level buffering and the sizing of those buffers to accommodate various problem sizes. However, 
for sparse DNN accelerators there is an additional wrinkle, because they need to cope with the 
fact that the space used in a buffer is now a function of sparsity. Since weight sparsity is known 
statically, the mapping can take those sizes into account, otherwise conservative assumptions or 
an exception handling mechanism to cope with buffer overflows would be needed. 

A final consideration is the actual data representation to be used for the sparse data. This 
can have a significant impact on the overall efficiency of a design and needs to be considered as 
a co-design with the dataflow selection. For example, the Cambricon-X team considered two 
compressed, implict-coordinate schemes for holding filter weights. They evaluated a representation 
using a bit-mask indicating non-empty elements in the weight fiber and an RLE-type scheme. 
They found the RLE-type scheme more efficient for their dataflow. 
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Figure 8.28: Effect of splitting a fiber into subfibers where each subfiber contains the same 
number of coordinates, i.e., evenly. This is referred to as splitting the fiber evenly in position 
space. 


8.3.2 EXPLOITING SPARSE ACTIVATIONS 


As described in Section 8.1.1, activations can also be sparse, so it should be unsurprising that 
analogous to the dataflows that exploit sparse weights in Section 8.3.1 one can find dataflows 
that exploit sparse activations. 

A 1-D weight-stationary convolution dataflow that exploits sparse activations is shown in 
Design 8.11. Iterating over the coordinates of the uncompressed weights in the outermost for 
loop (line 5) highlights that this is a weight-stationary dataflow. While in the inner for loop 
(line 6) the dataflow only iterates over the non-zero activations. Note, however, that depend- 
ing on the current weight’s coordinate, some input activations do not contribute to any output 
activations and so these edge effects are skipped with the if condition on the for statement on 
line 6. Hardware would need to account for such edge effects in this design and later designs. 
This hardware ideally would be able to save time for coordinates outside the constraints, but may 
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Design 8.11 1-D Weight-Stationary Convolution Dataflow - Sparse Activations 





1 i = Tensor(W) # Compressed input activations 
2 f = Array(S) # Uncompressed filter weights 
3 o = Array(Q) # Uncompressed output activations 


5 for s in [0, S): 


6 for (w, i val) in i if s <= w< Qs: 
7 gq=Ww-—s 
8 o[q] += i_val * f[s] 





Design 8.12 1-D Output-Stationary Convolution Dataflow - Sparse Activations 





1 i = Tensor(W) # Compressed input activations 
2 f = Array(S) # Uncompressed filter weights 
3 o = Array(Q) # Uncompressed output activations 


3 for q in [0, Q): 


6 for (w, i val) in i if q<=w<q+ S: 
7 sS=W-q 
8 o[q] += i_val * f[s] 





need to spend a cycle when it goes outside the constraint. Such considerations are beyond the 
scope of this book. 

Hardware that implements this dataflow could have logic that overlaps the determination 
of skipped activations with other processing and not actually waste cycles. The designer would 
also need to pick a fiber representation for the input activations that makes traversal of the 
non-zero elements and generation of their coordinates efficient. Thus, for example, either a 
coordinate/payload or an RLE-style fiber representation could be a good choice. 

A block diagram of the 1-D weight-stationary convolution dataflow with sparse activa- 
tions is shown in Figure 8.29. Note that there is a latch that holds the current weight and its 
coordinate “stationary” through a series of input activations. 

An alternative dataflow for exploiting sparse activation in a 1-D convolution is shown 
in Design 8.12. In the outer for loop (line 5), the dataflow iterates over all the coordinates 
of the output tensor, which confirms that this is an output-stationary dataflow. The inner for 
loop (line 6) iterates over a restricted set of the non-zero input activations that contribute to the 
current output. That restriction is represented by the condition in the if statement on line 6 
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Figure 8.29: 1-D Weight-Stationary Convolution Dataflow — Sparse Activations: in this design, 
the filter weight Cgen (outer loop) is configured to generate the coordinate sequence [0, S) 
exactly one time. The input activation Pgen (inner loop) generates the data dependent sequence 
of positions between [0, |i[]|), where s < w < Q + s for each weight (i.e., S times). That Pgen 
signals the beginning of each sequence on the “Next” control line to latch the current weight 
(f[s]) and its coordinate (s). MAC operations are performed for each non-zero input activation 
(i[w]) and “stationary” weight (f[s]) and added to the appropriate partial sum (o[w-s]) creating 
a new partial sum (o[w-s]), which updates the partial sum buffer. 


such that the iterator over the input activation fiber only returns the coordinates and values for 
the appropriate non-zero input activations in the fiber. 

Most of the hardware to cope with sparsity in Design 8.12 is concentrated in line 6. Here 
we see the iteration over the non-zero elements of the fiber containing the input activations, i. 
‘Thus, that fiber can be in a compressed representation (e.g., a coordinate/payload list). Line 6 
also shows the restriction to input activations that contribute to the current output activation. 
Careful examination of the pattern of coordinate/value pairs shows that the loop will traverse a 
sliding window of input activations, where each window has a variable number of coordinates, 
but will cover a constant distance in coordinate space. An example sequence of windows for a 
sparse set of input activations and a 3-wide filter are shown in Figure 8.30. Again, the logic to 
calculate that window can be overlapped with fetching values in the window. 

A block diagram of the 1-D output-stationary convolution dataflow with sparse activa- 
tions is shown in Figure 8.31. 
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Figure 8.30: Input activation sparse sliding Window — filter size S= 3. 


Cnvlutin 
The Cnvlutin sparse DNN accelerator is an augmentation of the output-stationary sparse acti- 
vation dataflow in Design 8.12 [228]. The augmentation includes processing input activations 
from multiple input channels simultaneously to allow a spatial accumulation (see Section 5.9). 
Cnvlutin also combines each input activation with weights from distinct output channel filters 
to simultaneously generate output activations for multiple output channels. It also needs to cope 
with the fact that the feature maps and filters have both a height and a width. Thus, the code 
for the dataflow for Cnvlutin would require additional for loops for the additional ranks of the 
weights, and input and output activations. 

Finally, Cnvlutin needs a representation for the input activations, and is described as hav- 
ing a relative coordinate (or offset)/payload representation. 


8.3.3 EXPLOITING SPARSE WEIGHTS AND ACTIVATIONS 


The preceding two sections (Sections 8.3.1 and 8.3.2) described dataflows that only exploit the 
fact that one of the operands of the convolution, either weights or input activations, were sparse. 
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Figure 8.31: 1-D Output-Stationary Convolution Dataflow — Sparse Activations: in this de- 
sign, the partial sum Cgen (outer loop) is configured to generate the coordinate sequence [0, Q) 
exactly one time. The input activation Pgen (inner loop) generates the data dependent sequence 
of positions between [0, |i[]|), where q < w < q + S for each partial sum (i.e., Q times). That 
Pgen signals the beginning of each sequence on the“Next” control line to move the Cgen to 
the next coordinate. MAC operations are performed for each non-zero input activation (i[w]) 
and weight (f[w-q]) and added to the appropriate partial sum (o[q]) creating a new partial sum 
(o'Lq]), which updates the partial sum buffer. Note this design could easily be extended to use a 
single “stationary” partial sum latch to hold the accumulating partial sum. 





Naturally, since a multiplier has to do no work when either of the input operands, weights or 
input activations, are zero, it would be attractive to try to implement a dataflow that saves time 
when either operand is zero. 

Exploiting sparsity in both weights and activations in a dataflow is more challenging than 
exploiting sparsity in only one datatype. Therefore, some works have combined a dataflow that 
exploits sparsity in one datatype (weights) with hardware that exploits bit-level sparsity in the 
other datatype (input activations) to reduce the cost of a multiplication [283]. However, a num- 
ber of works have successfully created dataflows that exploit sparsity in both datatypes, which 
we will describe in this section. 

A 1-D input-stationary convolution dataflow that exploits both weight and input activa- 
tion sparsity is presented in Design 8.13. Here, we see that every input activation is multiplied 
by every weight and accumulated into some output activation (at coordinate q = w-s). Note, that 
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Design 8.13 1-D Input-Stationary Convolution Dataflow - Sparse Weights & Activations 





1 i = Tensor(W) # Compressed input activations 

2 f = Tensor(S) # Compressed filter weights 

3 o = Array(Q) # Uncompressed output activations 
5 for (w, i_val) in i: 

6 for (s, f val) in f if w—Q < s <= w: 

7 g=Ww-—s 

8 o[q] += i_val * f val 





Design 8.14 1-D Sparse Input-Stationary Convolution Dataflow — Cartesian Product 





1 i = Tensor(W) # Compressed input activations 
2 f = Tensor(S) # Compressed filter weights 
3 o = Array(Q) # Uncompressed output activations 


5 for (w1, i split ) in i. splitEqual (2): 


6 for (s1, f split ) in f. splitEqual (2): 

7 parallel—for (w0, i val) in i_split : 

8 parallel—for (s0, f_ val) in f split if wO—Q<s0<=w0 
9 w= w0 

10 s = s0 

11 q =Ww—s 

2 o[q] += i val * f val 





this is with the exception of edge effects that try to generate values for invalid output activation 
coordinates. Those edge effects are controlled by the if statement on line 6. 

A block diagram of the 1-D input-stationary convolution dataflow with sparse weights 
and activations is shown in Figure 8.32. 


SCNN 

The dataflow in Design 8.13 has interesting ramifications when parallelism is added. Specifically, 
consider the dataflow in Design 8.14. Following the splitting pattern from Design 8.10, we 
generate fibers with two input activations (i_split in each iteration of line 5) and two weights 
(f_split in each iteration of line 6). Those pairs of values are delivered in parallel (in lines 7 
and 8) to the multipliers in line 12. 
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Figure 8.32: 1-D Input-Stationary Convolution Dataflow — Sparse Weights and Activations: 
in this design, the input activation Pgen (outer loop) is configured to traverse the positions 
of the non-zero inputs exactly once in the with the sequence [0, |i[]|). The filter weight Pgen 
(inner loop) generates a data dependent sequence of positions [0, | f[]|), where w — Q < s < w 
repeatedly for each input activation (i.e., |i[]|). Each time it starts a sequence it signals the latch 
via the “Next” control line to latch a new input. MAC operations are performed for each non- 
zero input activation (i[w]) and non-zero weight (f[s]) and added to the appropriate partial sum 
(o[w-s]) creating a new partial sum (o’|w-s]), which updates the partial sum buffer. 


Examination of the pattern of operand delivery to line 12 reveals that two inputs activa- 
tions (i_val) and two weights (f_val) are delivered to the four multipliers in a cross-product 
pattern. Figure 8.33 illustrates the flow of the information from the i_split and f_split fibers 
to the multipliers. The attractive feature of this topology is that reads of just 2 x N values pro- 
vide operands for N* multipliers. Note that while the values must flow to the multipliers, the 
coordinates must also flow to those units to calculate the coordinate of the output activation that 
needs to be accumulated into. So at the right side of the diagram the resultant spray of operands 
are shown. ‘This cross (or Cartesian) product is a integral part of the SCNN design [159], which 
uses clusters with 16 multipliers (i.e., N = 4). 

‘The input activation and weight fibers in a 1-D convolution would undoubtedly be too 
small to produce high utilization in the Cartesian product unit. So the SCNN design flattens 
(see Section 8.2.4) various ranks of both the input activations and weights. Specifically, the H 
and W ranks of the input activations are flattened (as illustrated in Figure 8.22) and the R, 
S, and M ranks of the weights are similarly flattened. This produces larger fibers that improve 
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Figure 8.33: Sparse Cartesian product. 


multiplier utilization in the sparse input-stationary Cartesian product dataflow. The tuples of 
coordinates in the flattened rank are used to compute the output activation coordinate for each 
product. In SCNN, distinct input channels (rank C) are processed serially and additional par- 
allelism is achieved by having distinct Cartesian product complexes work on different tiles of 
input activations (split in coordinate space on the H and W ranks). 

The loops in Design 8.13 could easily be reversed to create a weight-stationary dataflow, 
however, as argued in the SCNN paper, that would provide no performance gain and would 
result in more frequent accesses to the larger input activation storage array resulting in lower 
energy efficiency. 

The energy cost associated with SCNN’s spray of outputs into the output feature map 
tensor has led to a consideration of output-stationary dataflows that exploit sparsity in both 
weights and input activations. Design 8.15 illustrates such a dataflow. The outer loop (line 5) is 
a standard traversal of the coordinates of the output feature map tensor. The inner loop (line 6), 
however, show some new tricks: intersection and projection. 

The ampersand (&) operator returns the intersection of two fibers. This operator scans 
the coordinates of each of its input operands and returns a new fiber with only the coordinates 


220 8. EXPLOITING SPARSITY 
Design 8.15 1-D Output-Stationary Convolution Dataflow - Sparse Weights & Activations 





1 i = Tensor(W) # Compressed input activations 

2 f = Tensor(S) # Compressed filter weights 

3 o = Array(Q) # Uncompressed output activations 
5 for q in [0,Q): 

6 for (w, (f val, i_val )) in f. project (+q) &i: 

7 olq] += i val * f val 








Figure 8.34: Fiber intersection. 


that appear in both and also returns payloads that are a combination of the payloads from the 
original fibers. Figure 8.34 illustrates this action on two 1-D fibers. The intersection between the 
two fibers returns a new fiber with the common coordinates of the original fibers (2 and 5) and 
payloads that are tuples of the original fiber’s payloads ( (b, f) and (c, h) ). An implementation 
of a fiber intersection unit for coordinate/payload-style fibers was explored in the ExTensor 
design [276]. 

In order for the intersection to not repeatedly act on the same elements of both fibers (f 
and i), the filter weights involved in the intersection need to move as a window over the input 
activations for each distinct partial sum. To achieve this effect, we employ a projection method 
(project (<offset>)) that shifts the coordinates in a fiber by a specified offset.!’ The operation 


17 Although we use a simple offset for projection here, in general one would need something more powerful, like a lambda 
function, that can more generally calculate a new coordinate. 
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Figure 8.35: Fiber projection. 


of the project () method on a 1-D fiber is illustrated in Figure 8.35, where all the coordinates 
of the input fiber are shifted up by two. 

A block diagram of the 1-D output-stationary convolution dataflow with sparse weights 
and activations is shown in Figure 8.36. That diagram shows that separate streams of input acti- 
vations and (sliding window of) filter weights being intersected to create the stream of operand 
pairs for the MAC. 


SparTen 

The SparTen DNN architecture uses a core dataflow similar to Design 8.15 optimized with 
specialized tensor representations for the input feature map and filter weights in order to im- 
plement intersection efficiently [280]. SparTen uses a bitmask with zeros for zero coordinates 
and ones for non-zero coordinates, so the position in the bitmask corresponds to a coordinate. 
In such a representation, the getPayload() method is implemented by counting the number 
of ones in the bitmask up to the desired coordinate to find the position of the desired payload. 
Finding intersected coordinates uses a simple Boolean AND operation on the bitmasks. Finding 
the positions of the payloads that survive the intersection is performed by counting ones in the 
original bitmask up to the desired coordinate (i.e., invoking getPayload()). 


Eyeriss V2 

As described in Chapter 3, a key challenge in the design of DNN accelerators is finding the right 
balance between flexibilty and other metrics, primarily energy and performance. The Eyeriss 
V2 DNN accelerator strove to be efficient both across a wide range of DNN network shapes 
(see Chapters 2 and 9), but also work efficiently across a range of sparsities [161]. Like the 
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Figure 8.36: 1-D Output-Stationary Convolution Dataflow — Sparse Weights and Activations: 
in this design, the filter weight and input activation Pgens are each configured to traverse the 
positions of the non-zero values in their respective buffers repeatedly for each partial sum (i.e., Q. 
times). Ideally, the input activation Pgen will only traverse the active part of the sliding window 
of positions. In any case, the resultant stream of input activation coordinates and values; and 
the filter weight’s (projected) coordinates and values will be intersected generating a stream of 
values pairs when the same coordinate exists in both streams. Those resulting input activations 
and filter weights are multiplied together in MAC unit and added to the current partial sum 
from the partial sum buffer. The same partial sum (at coordinate q) is accumulated to until both 
the filter weight and input activation Pgens both have started a new sequence and informed the 
partial sum Pgen (via the ANDed combination of “Next” signals) that it should move to the 
next partial sum coordinate (q). 


SCNN and SparTen accelerators described above, Eyeriss V2 exploits both weight and input 
activation sparsity. But in contrast to SCNN, which loses significant efficiency when the data is 
dense, Eyeriss V2 strives to provide good efficiency irrespective of sparsity. It achieves this with 
a dataflow that provides modest parallelism per PE and good utilization across a diverse set of 
workloads. 

The Eyeriss V2 dataflow is a derivative of the original Eyeriss row-stationary dataflow (see 
Section 5.7.4). Each PE handles the convolution for one row of input activations, and partial 
sums are combined in a column of PEs to create an output activation. Eyeriss V2 is extended, 
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however, to support sparse input feature maps and filter weights. The PEs also provide 2-way 
parallelism. 

The dataflow for Eyeriss V2 for performing convolution'® is shown in Design 8.16. The 
PE’s dataflow works successively on each output activation location (q) in line 5 and each filter 
weight location (s) in line 6. It uses those coordinates to get a fiber (i_c) holding the input acti- 
vations for all the input channels using the getPayload() operation in line 8. Those accesses will 
follow a sliding window pattern through the input activations that the hardware must generate. 
Then for each input channel with a non-zero activation (i.e., traversing i_c), the getPayload() 
in line 10 will access a fiber of filter weights ( f _m) containing weights for each output channel 
with a non-zero weight. The concrete representation of f is uncompressed for the “C” and “S” 
ranks, which makes that getPayload() operation efficient. The filter weight tensor is in coor- 
dinate/payload form for the “M” rank, which allows for splitting the weights into groups of two 
(line 11) and processing two weights in parallel in line 13, where each weight is multiplied by 
the same input activation and accumulated in to o[m, q] in line 14. 

To improve throughput, the multiplies in the Eyeriss V2 PE are performed with two-way 
parallelism as indicated by the splitEven() and parallel-for in lines 11 and 12. The typical 
number of filter weights for a specific input channel is generally large enough to achieve good 
utilization of the multipliers. The two-way spray of parallel accumulates (line 13) also means 
that the design requires a two-port register file for output activations. This degree of parallelism 
is a compromise between the larger amount of regular parallelism that would be available for 
dense data, and both the amount of parallelism available and the complexity of a register file 
with more ports that would be needed to support more parallelism for sparse data. 


8.3.4 EXPLOITING SPARSITY IN FC LAYERS 


‘The previous subsections have considered the fundamental dataflows of DNN accelerators that 
targeted convolutional layers, but there also has been some work targeting FC layers. As de- 
scribed in Chapter 4, FC layers are essentially matrix-matrix or vector-matrix multiplications, 
so these accelerators target that calculation. 


EIE 

EIE is a DNN accelerator for FC layers that processes an input activation vector into a M- 
channel output activation vector [282]. A simplified representation of the EIE DNN accelerator 
for FC layers is depicted as Design 8.17. There are two things to note about this dataflow. 
First, as described in Section 8.2.4, the input channel (C), height (H), and width (W) ranks 
can be flattened into one rank (CHW). Second, recall that for FC layers the range of filter 
weights is equal to the range of input activations (i.e., CRS == CHW). So we see that for each 
input activation, this input-stationary dataflow selects a row of weights (f_m) and multiplies the 


18Eyeriss V2 also supports FC layers with a different dataflow. 
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Design 8.16 Eyeriss V2 Convolution Dataflow - Sparse Weights & Activations 





1 i = Tensor(C,W) # Compressed input activations 
2 f = Tensor(C,S,M) # Compressed filter weights 
3 o = Array(M,Q) # Uncompressed output activations 


s for q in [0, Q): 
6 for s in [0,S): 


7 w= qts 

8 i_c = i_w.getPayload(w) 

9 for (c, i val) in ic: 

0 f_m= f_c.getPayload(c, s) 

1 f_m_split = f_m. splitEqual (2) 

2 for (_, f_m) in f_m_split: 

3 parallel—for (m, f_val) in f_m: 
4 o[m, q] += i_val * f val 





Design 8.17 1-D Input-Stationary Fully Connected Dataflow — Sparse Weights & Activations 





1 i = Tensor(CHW) # Compressed input activations 

2 f = Tensor(CHW, M) # Compressed filter weights 

3 o = Array(M) # Uncompressed output activations 
5 for (chw, i_val) in i: 

6 f_m = f.getPayload (chw) 

7 for (m, f_val) in f_m: 

8 olm] += i_val * f val 





“stationary” input activation by each of the weights in the row, and contributes the product to 
the partial sum for the appropriate output channel (m). 

In order to have efficient execution of the getPayload() method on the filter weights’ 
CHW rank and concordant traversal of the filter weights’ M rank, the fiber-tree should have a 
rank order of CHW and M. Furthermore, it is extremely likely that there is at least one weight 
for each coordinate in CHW (i.e., that rank is dense), so an uncompressed format is appropri- 
ate. Thus, EIE uses a format that is a representation for a 2-D tensor using an uncompressed 
upper rank and an RLE-style lower rank because only a fraction of the output channels will 
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Figure 8.37: Effect of splitting a fiber uniformly in coordinate space by 3. 


have a weight for any particular input activation coordinate.!? It appears that EIE also achieves 
concordant traversal of the input activations by employing an uncompressed format. 

To provide parallelism, EIE partitions the M rank of the weight tensor in position space 
among multiple PEs using a filter weight tensor where the output channels (M) are split equally 
in position space into two ranks (M1 and MO) and then the ranks are re-ordered M1, CHW, 
and MO. 

Design 8.18 shows EIE’s parallelism. Since this splitting is done on weights, which are 
known statically, the partitioning need not be done at runtime. However, note further that the 
subtrees that result from this split may not all be the same size. This can result in load imbalance, 
which can cause under-utilization of the MAC units and is a pervasive issue in sparse dataflows. 
One can also see that the same input activations’ coordinate (chw) and value (i_val) are used 
by multiple parallel units, so they must be broadcast to all the PE units. 

One final attribute of the EIE design is that it uses quantized weights (see Chapter 7). 
This could be added to this (or most any) dataflow in a straightforward fashion by treating the 
weight (f_val) as a index into another array containing the actual weights. 


1°The EIE papers calls this a variant of CSC, but our taxonomy would make it a distinct format. 
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Design 8.18 Parallel Input-Stationary Fully Connected Dataflow - Sparse Weights & Activa- 


tions 








1 i = Tensor(CHW) # Compressed input activations 
2 f = Tensor(M1,CHW, M0) # Compressed filter weights 
3 o = Array(M) # Uncompressed output activations 
5 for (chw, i_val) in i: 
6 parallel—for (m1, f_chw) in f: 
7 f m = f chw. getPayload (chw) 
8 for (m, f_val) in f_m: 
9 o[m] += i_val * f val 
ExTensor 


FC layers that exploit both sparse input activations and weights can also use other dataflows. 
For example, although not specifically designed for DNN acceleration, the ExTensor accelerator 
implements multi-level tiled sparse matrix multiplication [276]. Therefore, it can execute an ar- 
bitrary batch size (N) for fully connected layers. Furthermore, it selectively uses either a weight-, 
input-, or output-stationary dataflow at each level of the storage hierarchy and its performance 
is enhanced with optimized intersection units. 

Tiling the operands of a multiply requires that the corresponding coordinates exist in the 
tiles of both operands, because the arithmetic units need operands with the same coordinates. 
This involves splitting fibers in coordinate space. 

Figure 8.37 displays the effect of splitting a fiber uniformly in coordinate space. The figure 
shows that the coordinates of the original fiber divided into groups of three by coordinates. 
‘Therefore, the groups in coordinate space are 0-2, 3-5, 6-8, and 9-11. The newly created upper 
fiber (S1) has a coordinate that matches the first coordinate of each group, and has a payload 
that is a fiber in the lower rank (SO) with the coordinate/payload pairs that existed in the original 
fiber with coordinates of that group. Note that the fibers in SO are of different sizes. In fact, since 
there were no coordinates in the range 3-5, i.e., the fiber in S1 would be empty, and there is no 
coordinate 3 in the upper (S1) fiber. This can lead to load imbalance between parallel units. 

A sample weight-stationary single-tile-level ExTensor dataflow is shown in Design 8.19. 
Like the output-stationary sparse convolution dataflow (Design 8.15), this dataflow also em- 
ploys an intersection. However, note that the intersection is between the coordinates of weight 
values (f_val) and the coordinates of input activation fibers (i_n). This implies that when an 
intersection drops a coordinate a considerable amount of work might be being saved (i.e., the 
entire traversal of the input activation fiber). 
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Design 8.19 ExTensor Weight-Stationary Fully Connected Dataflow — Sparse Weights & Ac- 
tivations and batch size N 





i = Tensor(CHW, N) # Compressed input activations 
f = Tensor(M, CHW) # Compressed filter weights 
o = Array(M) # Uncompressed output activations 


for (m, f_chw) in f: 
for (chw, (f_val, i_n)) in f_chw &i: 
for (n, i_val) in i_n: 
o[n,m] += i-val * f val 





8.3.5 SUMMARY OF SPARSE DATAFLOWS 


In this section, we have surveyed various dataflows used to exploit sparsity in both filter weights 
and input feature maps for both convolutional and FC computations. These dataflows form the 
core of a variety of DNN accelerator designs, each of which is augmented with additional layers 
of buffering and higher-level parallelism. The dataflows were represented in a uniform nota- 
tion that separated the dataflow from the details of the representation (and manipulations) of 
the input and output tensors. This allows for a comparison of the core computation flow of the 
individual designs, by illuminating the data accesses and the operations needed to implement 
each dataflow. This can be used to characterize their behavior on various workloads and to infer 
the hardware necessary to implement the dataflows. At present, however, there is no compre- 
hensive comparative analysis of the alternative sparse dataflows across a wide range of design 
parameterizations (e.g., buffer sizes) and workloads. 


8.4 SUMMARY 


This chapter explored the origins, explicit creation, and exploitation of sparsity in DNN com- 
putations, where sparsity refers to the fact that there are many repeated values, usually zeros, 
in the data. This chapter presents various sources of sparsity (e.g., the ReLU nonlinearity that 
sets negative values to zero in the feature map activations, or repeated values in the weights of 
a filter) as well as methods that can increase sparsity (e.g., exploiting correlation in the data or 
removing weights using pruning). 

Two potential architectural benefits of sparsity were also discussed: (1) compression of 
sparse data can reduce its footprint, which provides an opportunity to reduce storage require- 
ments and data movement; and (2) sparsity presents an opportunity for a reduction in MAC 
operations. The reduction in MAC operations results from the fact that 0 x anything is 0. This 
can result in either savings in energy or time or both. 
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To explore the potential architectural benefits on energy-efficiency and throughput of 
sparsity, this chapter presented a sampling of sparse architectures as a composition of two ele- 
ments: (1) a dataflow that operated on an abstract representation of a sparse tensor; and (2) im- 
plementation choices for the concrete data formats for the tensors. 
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CHAPTER 9 


Designing Efficient DNN 
Models 


‘The previous two chapters discussed the use of DNN model and hardware co-design approaches, 
such as reducing precision (Chapter 7) and exploiting sparsity (Chapter 8), to reduce storage re- 
quirements, data movement, and the amount of MAC operations required for processing a DNN 
model. In this chapter, we will discuss how designing DNN models with efficient “structures,” 
often referred to as the DNN network architecture,‘ can also help enable efficient processing of 
DNNs. The network architecture is defined by the layer types, the layer shapes, the number of 
layers, and the connections between layers, as defined in Chapter 2. Designing efficient network 
architectures involves applying techniques to these different aspects to enable efficient process- 
ing. As with the other co-design approaches, the main challenge is to improve the efficiency 
of the network architecture as evaluated by the metrics described in Chapter 3, such as energy 
consumption and latency, without sacrificing the accuracy. 

In earlier works, improving the network architecture relied on the researchers’ expertise 
to manually design layers and figure out the optimal connections between them; however, this is 
often a tedious and challenging task. Consequently, in recent years, the use of machine learning 
to automatically design the network architecture, referred to as Neural Architecture Search (NAS), 
has become an increasingly popular research area. While many efficient network architecture 
design approaches focus on reducing the number of weights, activations, and MAC operations 
to improve efficiency, this does not necessarily translate to reduced energy consumption and 
latency. As previously discussed, one must also account for factors, such as utilization and the cost 
of data movement, which depend on the dataflow and memory hierarchy; in other words, one 
must consider how the DNN model is mapped onto the hardware in order to evaluate efficiency. 
Accordingly, recent research efforts have proposed methods that directly target hardware metrics 
such as energy consumption and latency when designing efficient network architectures. 

In this chapter, we will first discuss common methods that are widely used in manual 
network design. Second, we will explain how NAS can be used for network architecture design, 
describe the key components of NAS, and discuss the associated design considerations. In this 
context, we will also discuss how to bring hardware into the design loop to directly target metrics 
such as energy consumption and latency. Third, we will discuss a class of methods called knowl- 


1Note: The term DNN network architecture differs from the hardware architecture or network-on-chip architecture de- 
scribed in the other chapters. 
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Figure 9.1: Dimensions of a CONV layer. 


edge distillation, which can be combined with network architecture design to further increase 
the accuracy. Finally, we will discuss design considerations when choosing amongst the various 
techniques for designing efficient DNN models. 


9.1 MANUAL NETWORK DESIGN 


The CONV and FC layers account for most of the computation and data movement of a DNN 
model. Therefore, manual network design focuses on improving the efficiency of these two types 
of layers. Manual design approaches typically focus on reducing the number of weights, activa- 
tions, and/or MAC operations to indirectly reduce energy consumption and reduce latency. 


9.1.1 IMPROVING EFFICIENCY OF CONV LAYERS 


Various methods have been proposed to improve the efficiency of CONV layers by reducing the 
number of weights in the filters, which may help reduce storage requirements, data movement, 
and number of MAC operations. As shown in Figure 9.1, the filters in the CONV layer are 
parameterized by the spatial size (R and S) of the filter, the number of input channels (C), 
and the number of output channels (M). Recall that the number of output channels (M) also 
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(a) Approximating a 5x5 filter using 3x3 filters. Used in VGG-16. 
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(b) Approximating a 5x5 filter using 1x5 and 5x1 filters. Used in GoogLeNet/Inception v3 and v4. 























Figure 9.2: Approximating larger filters using smaller filters. 


corresponds to the number of filters. These methods can be categorized based on how they 
reduce each of these dimensions.? 

The spatial size (R and S) of the filter can be reduced by replacing a single filter with a 
large spatial size (large filter) by several filters with small spatial sizes (small filters). Several small 
filters can emulate the receptive field of a large filter but with fewer operations and weights, as 
shown in Figure 9.2. For example, one 5x5 convolution can be approximated by two 3x3 convo- 
lutions [73], as shown in Figure 9.2a. Alternatively, one RxS convolution can be approximated 
by two 1-D convolutions, one 1xR and one Sx1 convolution [76], as shown in Figure 9.2b. A 
similar idea has been widely used in image processing for decades and achieved great success in 
improving algorithm efficiency when the filters are separable [137]. 

The number ofinput channels (C ) can be reduced by using 1x1 CONV layer insertion and 
grouped convolutions. 1x1 CONV layer insertion [24, 74, 75] involves inserting a 1x1 CONV layer 
before a “large” CONV layer to reduce the number of input channels (C) in the “large” CONV 
layer; a “large” CONV layer typically refers to a CONV layer where R and S are greater than 
one. Specifically, the inserted 1x1 CONV layer has fewer output channels than input channels 
(M < C), which reduces the number of input channels in the next layer from C to M; this is 
often referred to as a “bottleneck,” as discussed in Section 2.4.1. For instance, Figure 9.3 shows 


Note that one potential downside of reducing the number of weights in a filter is that it reduces the number of unique 
filters that can be represented, which may impact accuracy. 
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Figure 9.3: The 1x1 CONV layer insertion can be used to reduce the number of output channels 
(M) of the current layer, and consequently the number of input channels (C) in the next layer. 
In this figure, the number of output channels is reduced from 64 to 32 by using a 1x1 CONV 
layer with 64 input channels and 32 filters. Therefore, the next CONV layer only needs to have 
32 input channels instead of 64. 
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how a 1x1 CONV layer with 64 input channels and 32 filters can transform an input with 
64 input channels into an output of 32 channels, which reduces the number of input channels 
in the next layer to 32. SqueezeNet is an example of a network architecture that extensively 
uses 1x1 CONV layer insertion to reduce the number of weights [284]. It proposes the use of 
a fire module that first reduces the number of input channels with a 1x1 CONV layer and then 
increases its number of output channels with multiple 1x1 and 3x3 convolution layers. This 
approach results in a network architecture that has 50x fewer weights than AlexNet, while still 
achieving the same accuracy. 
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Grouped convolutions (GC) divide the filters and the input channels into multiple groups. 
For conventional convolution, each of the M filters, with C channels, is applied to all C input 
channels. For GC, the M filters are divided into G groups, with M/G filters per group and 
C/G input channels per group; within each group, each of the M/G filters, with C/G channel, 
is applied to the C/G input channels within the same group. As a result, the number of channels 
in the filter is reduced by G times, which leads to G times reduction in the number of weights 
and MAC operations.° Figure 9.4 gives an example of GC with G = 2 and M = 4, where the 
first half of the filters are applied to the first half of the input channels to generate the first two 
output channels, and the second half of the filters are applied to the second half of the input 
channels to generate the last two output channels. GC was first introduced by AlexNet to fit a 
layer onto two GPUs, as discussed in Section 2.4.1. 

A downside of GC is that there maybe be an accuracy loss due to the reduced receptive 
field in the channel dimension (see Section 2.1). The receptive field of each group is restricted to 
a subset of input channels, which prevents the cross-channel information from being fully ex- 
ploited in a similar manner as for a conventional convolution. In other words, each output chan- 
nel no longer contains information from all the input channels. This limitation can be addressed 
by adding an additional 1x1 convolution (called point-wise convolution) [183] to combine the 
information from all the input channels. Another approach is to use a shuffling operation [285] 
to shuffle the output channels across groups, such that after mu/tiple layers each output chan- 
nel will contain information from all input channels in the previous layers. Figure 9.5 shows an 
example of the shuffling operation for GC with G = 2 and M = 4. 

Depth-wise convolution is an extreme case of GC, where G = C, as shown in Figure 9.6. 
As a result, each group has one filter and one input channel, and each filter has only one chan- 
nel and thus performs 2-D filtering. Depth-wise convolution is typically followed by point-wise 
convolution with C channels to combine the different output channels from the depth-wise con- 
volution. MobileNets [183] use the combination of depth-wise convolution followed by point- 
wise convolution to significantly reduce the number of weights and MAC operations. 

Squeeze and excitation (SE) [286] can be viewed as depth-wise convolution with dynamic 
weights, where the weights change based on the input feature map rather than being fixed (sza‘ic) 
after training, to increase the accuracy. The idea of SE is to put more attention on the certain 
channels of the feature map (determined from training) by increasing the magnitude of their 
activations and decreasing the magnitude of activations in the other channels. Figure 9.7 illus- 
trates the SE operation. It first applies global pooling (see Section 9.1.2) to reduce the spatial 
resolution of the input feature map to 1 x 1. This feature map will then be processed by multiple 


3 Another way to think about this is that the number of input channels that contribute to each output channel is reduced 
from C to C/G. 
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Figure 9.4: This figure illustrates grouped convolution with two groups (G = 2). ‘The first half 
of the filters are applied to the first half of the input channels to generate the first two output 
channels, and the second half of the filters are applied to the second half of the input channels 
to generate the last two output channels. Note that the number of channels of filters is reduced 
from C to C/2, which leads to 2x reduction in the number of weights and MAC operations. 
Note that this example only uses a batch size of 1 (N = 1) and thus each feature map is labeled 
at “1”. For illustrative purposes, we repeat the input and output feature maps so that the reader 
can see which channels of the feature map are being processed by each filter. 
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Figure 9.5: This figures illustrates the shuffling operation for grouped convolution. In this ex- 
ample, G = 2 and M = 4, where the receptive field of the first group is restricted to the first 
half of the input channels, and the receptive field of the second group is restricted to the second 
half of the input channels. By swapping the first and third output channels of the layer across 


the different groups, the output channels within each group contain information from all input 
channels. 
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Figure 9.6: Depth-wise convolution is an extreme case of GC, where G = C. As a result, each 
group has one filter and one input channel, and each filter has only one channel and thus per- 
forms 2-D filtering. Note that this example only uses a batch size of 1 (N = 1) and thus each 
feature map is labeled at “1”. For illustrative purposes, we repeat the input and output feature 
maps so that the reader can see which channels of the feature map are being processed by each 
filter. 
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Figure 9.7: Squeeze and excitation can be viewed as a 1 x 1 depth-wise convolution with dy- 
namic weights and consists of three steps: (1) global pooling is applied to reduce the resolution 
of the input feature map to 1 x 1; (2) multiple FC layers with nonlinearity is used to process 
resulting feature map; and (3) 1 x 1 depth-wise convolution, which uses the output of the FC 
layers as dynamic weights in the C 1 x 1 filters, is used to process the original input feature map. 


FC layers with nonlinearity. The results are used as the dynamic weights of a 1 x 1 depth-wise 
convolution‘ to process the original input feature map. 

The number of output channels (M) that needs to be generated per layer can be reduced 
by reusing the output channels of output feature maps generated by previous layers, as shown in 
Figure 9.8. Reusing output channels has been shown to provide a good trade-off between accu- 
racy and efficiency. Feature map aggregation defines how output channels from previous layer(s) 
can be combined (e.g., concatenated) and reused; this is one of the key distinguishing proper- 
ties between different output channel reuse approaches. For instance, DenseNet [84] proposes 
the Dense Block where each layer concatenates the output channels from all the previous layers 
as the input. In comparison, Yu et al. [287] explores different ways to hierarchically combine 
the output channels rather than combining all the previous output channels in a single step. 

Note that output feature maps with different spatial resolutions cannot be directly com- 
bined (reused). This happens frequently in dense prediction applications, such as image seg- 
mentation. One common way to address this is to upsample the lower resolution feature map 
to match the higher resolution feature map. However, this method increases the memory re- 
quirements quadratically with respect to the up-sampling factor. Rather than up-sampling the 
low resolution feature map, the space-to-depth (S2D) operation [241] can be used to reduce the 
spatial resolution of the high-resolution feature map without losing information, as shown in 
Figure 9.9. Specifically, the S2D operation moves input activations from the spatial dimension 

4Recall that 1 x 1 depth-wise convolution differs from conventional 1 x 1 convolution (also referred to as point-wise 


convolution) in that 1 x 1 depth-wise convolution performs 2-D convolution while 1 x 1 point-wise convolution performs 
3-D convolution. In both cases, 1 x 1 refers to the spatial dimension of the filter, specifically, R=1 and S=1. 
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Figure 9.8: Reusing the output channels of feature maps from previous layers can help reduce 
the number of filters in a layer. 


to the channel dimension. After filtering the combined feature map, the high resolution can be 
restored using depth-to-space (D2S) operation, which is the inverse of S2D, as shown in Fig- 
ure 9.9. With the S2D and D2S operations, the spatial resolution of output feature maps can 
be changed without substantial increase in memory requirements. 


9.1.2 IMPROVING EFFICIENCY OF FC LAYERS 


One of the main challenges of FC layers is the large number of weights, since each filter needs 
to have an associated weight for each activation in the feature map (i.e., R = H, S = W). Asa 
result, the number of weights grows quadratically with the resolution of the input feature map 
(i.e., H x W). Therefore, the number of weights can be significantly reduced by decreasing the 
resolution of the feature map. The trend in image classification is to keep only one FC layer 
at the end of the network and insert a g/obal pooling (also referred to as adaptive pooling) layer 
right in front of it. The global pooling layer, as shown in Figure 9.10, is the same as a regular 
pooling layer, except that its receptive field (window size) is always the same as the resolution 
of the input feature map. This layer reduces the resolution of the feature maps to | x 1 while 
keeping the number of channels the same and hence decreases the required number of weights 
in the following FC layer. 
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Figure 9.9: An example of the space-to-depth (S2D) and depth-to-space (D2S) operations. The 
S2D operation moves input activations from the spatial dimension to the channel dimension, 
and the D2S operation is the inverse. 
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Figure 9.10: The global pooling layer is the same as a regular pooling layer, except that its window 
size is always the same as the resolution of the input feature map. ‘This layer reduces the resolution 
of the feature maps to 1 x 1 while keeping the number of channels the same. 


9.1.3 IMPROVING EFFICIENCY OF NETWORK ARCHITECTURE 
AFTER TRAINING 


‘The previous sections described methods of designing the network architectures before training. 
The DNN model can also be made efficient after training. Specifically, the method of approx- 
imating large filters by a series of small filters (Section 9.1.1) can be applied after training the 
weights in the DNN model through a process called tensor decomposition. It treats the weights in 
a layer as a 4-D tensor and decomposes it into a combination of smaller tensors (i.e., several lay- 
ers), which jointly approximate the original 4-D tensor. Low-rank approximation can then be 
applied to further increase the compression rate at the cost of accuracy degradation, which may 
be restored by fine-tuning the weights. This approach has been demonstrated using Canonical 
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Polyadic (CP) decomposition, which is a high-order extension of singular value decomposition 
that can be solved by various methods, such as a greedy algorithm [288] or a nonlinear least- 
square method [289]. Combining CP-decomposition with low-rank approximation can achieve 
4.5x speed-up on CPUs [289]. However, CP-decomposition cannot be computed in a numer- 
ically stable way when the dimension of the tensor, which represents the weights, is larger than 
two [289]. To alleviate this problem, Tucker decomposition can be used instead [290]. 


9.2 NEURAL ARCHITECTURE SEARCH 


Neural architecture search (NAS) aims to automatically find a network architecture that achieves 
good performance, where performance for network architecture typically refers to a good trade- 
off between accuracy and latency, or accuracy and energy. Manual network design has been 
an effective approach for designing network architecture with reasonable performance, but de- 
termining the network architecture is a tedious process. For example, there are a large num- 
ber of hyperparameters to tune, such as the number of layers, the connections between layers, 
and the type and shape of each layer. Due to the highly complex relationship between these 
hyperparameters and the performance of the resulting network, determining the optimal net- 
work architecture usually relies on trial and error, which makes network architecture design very 
complicated and time consuming. To address this problem, NAS leverages the advancement in 
machine learning to automate the design process. 

Figure 9.11 illustrates the general flow of NAS algorithms. NAS is generally carried out 
in an iterative manner. At each iteration, the optimization algorithm samples several network 
architectures (i.e., samples) from a predefined search space, which consists of all discoverable 
network architectures. The performance of each network sample will then be evaluated. Based on 
the evaluation results, the optimization algorithm samples the next set of network architectures 
from the search space. This process continues until a termination criterion (e.g., the maximum 
number of search iterations) is met and generates the searched network architecture. 

In summary, the three main components of NAS are as follows: 


e Search space: what is the set of all samples. 
e Optimization algorithm: where to sample. 
* Performance evaluation: how to evaluate samples. 


‘The two main metrics for gauging the performance of a NAS algorithm are (1) the achiev- 
able network performance and (2) the required search time. Ideally, the network with the best 
performance can be found using an exhaustive search, which evaluates all the possible networks 
and selects the best one. However, this is impractical given the large number of possible net- 
works. For example, exhaustively searching for the optimal network architecture with up to 10 
layers and 100 filters per layer involves evaluating 101° networks. Therefore, NAS research fo- 
cuses on reducing the search time with minimal loss in the achievable network performance. 
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Figure 9.11: The illustration of the general flow of neural architecture search. 


‘The search time of NAS (timeygs) can be estimated by the following equation: 
timenas = NUM samples x time sample, (9.1) 


where nuMsampies is the number of samples explored and tiMesampie is the time required for eval- 
uating a sample. Each term can be further decomposed to reveal the following factors: 


S1Z€search_space X NUMalg_ tuning 





timenas X ( ) X (timegya + time;ain). (9.2) 


efficiencyaig 
The number of samples (numgampies) is determined by the size of the search space (sizesearch_space)s 
the efficiency of the optimization algorithm (efficiency,/,), and the number of times it takes to 
tune the optimization algorithm (numyig tuning) 

The design and size of the search space offers a trade-off between performance versus 
search time. A larger search space can allow for more discoverable networks, which can im- 
prove performance; however, it may require more samples in order to find the optimal network 
architecture with the improved performance. 
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Table 9.1: This table summarizes which terms in Equation (9.2) can be improved by improving 
each of the three main components of NAS 
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‘The selection of the optimization algorithm also influences performance and search time. 
A more efficient optimization algorithm (efficiency,i,) can better utilize the samples and hence 
reduce the number of required samples. However, it is also important to consider the difficulty 
of tuning the hyperparameters of the optimization algorithm itself (e.g., network architecture of 
the reinforcement learning agent). Optimization algorithms that are difficult to tune may require 
multiple iterations before they can enable effective search (nuMgig_juning). This critical factor is 
often overlooked. 

The time required for evaluating a sample (timesampie) includes the time required for eval- 
uating the network performance (timeg,a). Once a given network is sampled, it may need to be 
trained to get the precise accuracy numbers, which leads to the training time (timejrain). 

Researchers improve NAS algorithms by introducing innovations in the three main com- 
ponents, where each improves different terms in Equation (9.2) (summarized in Table 9.1): 


° Shrinking the search space, which reduces sizesearch_space- 


e Improving the optimization algorithm, which increases efficiencygig and reduces 


NUMa/¢ tuning: 
* Accelerating the performance evaluation, which reduces tiMeeya; and timeyain. 


It is important to note that these three components are not independent of each other, and a 
change in one component may involve a change in another component. For example, some op- 
timization algorithms cannot support hardware metrics (e.g., latency and energy consumption) 
and thus can only use proxy metrics (e.g., number of MACs and number of weights). 


9.2.1 SHRINKING THE SEARCH SPACE 


Shrinking the search space increases the search speed by limiting the discoverable network ar- 
chitectures. The idea is to only search a subset of the network architectures in the network ar- 
chitecture universe out of all the possible network architectures. Although this class of methods 
can effectively reduce the required number of samples, it may irrecoverably limit the achievable 
network performance and needs to be carried out carefully. For example, the optimal network 
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Figure 9.12: Different methods for shrinking the search space and the relative sizes of the cor- 
responding search spaces. 


architecture may fall outside of the search space, so it is impossible for the optimization algo- 
rithm to sample it. Therefore, the domain knowledge learned from manual network design plays 
an important role in properly guiding the reduction of the search space. 

Figure 9.12 illustrates different methods for shrinking the search space, and the relative 
sizes of the corresponding search spaces. The search space is determined by the possible layer 
types and the possible connections between layers. 

It is a common practice for NAS to reduce the possible layer types to a set of widely used 
layer types, such as convolution with different filter sizes and strides, and pooling with different 
pooling functions. ‘These layer types have shown to be effective in manual network design and 
have become key components of modern network architectures. However, fixing the layer types 
also prevents NAS from discovering new layer types. 

‘There are a wider variety of methods for reducing the possible connections between layers. 
Searching arbitrary connections provides the maximum flexibility, but it is intractable (all con- 
nections in Figure 9.12). A few works [291, 292] add some simple constraints, such as setting 
the maximum depth of the network, and show that a new network architecture can be discov- 
ered (all connections + simple constraints in Figure 9.12). However, the resulting search space 
is still too large in practice. This requires a significant amount of computational resources. 

Motivated by the modular design strategy in manual network design, other methods [293- 
303] first search the connections between a few layers, define them as a block, and then connect 
these blocks in a predefined way (block type search + predefined connections in Figures 9.12 
and 9.13a). Because blocks contain significantly fewer layers than the whole network, the con- 
nections in blocks are easier to search. 

‘The search space can be further reduced by pre-defining the connections in the blocks 
and only searching for the layer types to use for each layer [258, 304-308] (layer type search + 
predefined connections in Figures 9.12 and 9.13b). With more constraints, we further reduce 
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Figure 9.13: Illustration of two ways to shrinking the search space. 


the search space and hence the search time. However, more domain knowledge is required to 
guarantee that there exist network architectures with good performance in the reduced search 
space, and NAS becomes more similar to manual network design. Therefore, increasing the 
search speed and the performance of the searched network architecture while minimizing the 
required domain knowledge is a key challenge of NAS. 


9.2.2 IMPROVING THE OPTIMIZATION ALGORITHM 


These optimization algorithms differ in several aspects such as how they use the previous sam- 
ples to determine the next set of samples, which leads to different computational complexity, 
restrictions to the search space and performance metrics, and the ease of hyperparameter tuning. 
Popular optimization algorithms for NAS include random search, coordinate descent, gradient 
descent, evolutionary algorithm, reinforcement learning, and Bayesian optimization. Each has 
its own benefits and drawbacks, and which algorithm to select depends on the target application 
and other factors, such as the search space and the performance metrics used. We will briefly 
introduce these six optimization algorithms in their canonical form. 

Random search [296] is one of the simplest optimization algorithms of NAS. It randomly 
samples the entire search space and chooses the sample with the best performance. Although 
this algorithm is simple, it can find the networks with similar performance to those found by 
more complicated optimization algorithms [302]. Random search does not use the samples from 
the previous iterations to determine the next set of samples, which reduces the search efficiency. 
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Figure 9.14: An example of the coordinate descent optimization algorithm on a network with 
two CONV layers and an FC layer. 


In contrast, coordinate descent [258] does use the samples from the previous iterations. 
At each iteration, it starts from the previous best sample and greedily samples the nearby region 
along a few predefined “directions” in the search space. Searching along a “direction” can involve 
sweeping a hyperparameter and keeping the other hyperparameters fixed. Figure 9.14 illustrates 
an example of applying coordinate descent on a network with two CONV layers and an FC layer. 
In this example, the search space is defined by the number of filters in each CONV layer, and 
the two pre-defined directions are moving leftward (i.e., reducing the number of filters in the 
CONV layer 1), and downward (i.e., reducing the number of filters in the CONV layer 2). After 
evaluating the samples along the two directions, it adopts the sample with the best performance 
and then performs the same process again. Because it greedily samples the networks along all 
the predefined directions, the number of directions is limited. 

Gradient descent [293, 303, 305, 307, 308] also starts from the previous best sample like 
coordinate descent, but searches along the direction indicated by the gradient. The gradient can 
be computed analytically and efficiently without explicitly exploring multiple directions. How- 
ever, it requires that the gradient can be computed, which is not always feasible. For example, 
certain performance metrics (e.g., latency) are not differentiable. 

Evolutionary algorithms [292, 300, 302, 306] typically keep several samples (e.g., a thou- 
sand) from the previous iterations. At each iteration, evolutionary algorithms randomly remove 
samples with the worse performance and sample the neighborhoods of the remaining networks 
in the search space. Keeping multiple samples allows multiple positions in the search space to 
be searched in parallel; however, it also increases the requirements on computation resource and 
introduces new hyperparameters, such as the number of samples to keep. 


246 9. DESIGNING EFFICIENT DNN MODELS 


Reinforcement learning [291, 295, 297-299] proposes a better way to use the previous 
samples. At each iteration, instead of keeping the previous best sample and starting from it, 
reinforcement learning trains an agent to learn from the previous samples and uses the agent 
to determine the next set of samples. Therefore, the new samples do not need to be in the 
neighborhood of the previous best sample. This provides higher flexibility for the search, but 
designing the agent and tuning its hyperparameters can be challenging and more complicated 
than other optimization algorithms. 

Bayesian optimization [301, 304] takes a very different route from the previously discussed 
approaches. It aims to model the distribution of the entire search space so that it can pick the 
sample with the best performance according to the distribution model. Starting from an initial 
guess of the distribution model of the search space (i.e., the prior), it updates the model using 
the samples at each iteration and generates the next set of samples based on the updated model. 
With the prior, this method can better incorporate the knowledge learned from manual network 
design. However, the main issue is that precisely modeling a large search space can be difficult 
and requires many samples. 


9.2.3 ACCELERATING THE PERFORMANCE EVALUATION 


Accelerating the performance evaluation is another effective way to speed up NAS. Because 
NAS requires only the rank of the performance values instead of the exact values, this leaves room 
for approximation. There are at least three items that can be approximated: accuracy, weights and 
hardware metrics. 

The accuracy is one of the most important metrics for network performance and can be 
approximated using (1) proxy task, (2) early termination, and (3) accuracy prediction. The proxy 
task is a simpler task that can be used to approximate the target task, such as using a simpler 
and smaller dataset [291, 297, 298, 300] and reducing the resolution of the images [303]. For 
example, CIFAR-10 [105] is usually used as the proxy task for ImageNet [23]. Early termina- 
tion [258, 292, 295, 296, 298, 301-303] (Figure 9.15a) terminates network training before the 
training converges and uses the accuracy at this point as the approximation. For instance, if a 
network requires one million iterations to converge, we would only train it for 10,000 iterations. 
Accuracy prediction [306] (Figure 9.15b) takes this one step further by extrapolating the early 
terminated training curve to predict the converged accuracy. 

In addition to the accuracy, we can also approximate the weights to speed up training if 
there exists a trained network with a similar network architecture as the current sampled net- 
work. For example, if we reduce the number of filters in a layer of the trained network to gen- 
erate the current sample, we can either (1) transfer the weights or (2) estimate the weights. For 
transferring weights [258, 292-294, 299, 302, 305, 307, 308] (Figure 9.16a), because the cur- 
rent sample shares a similar network architecture of a trained network, we can directly initialize 
the weights of the sample by using part of the weights in the trained network. For estimating 
the weights [253] (Figure 9.16b), the weights can be quickly approximated by solving an £2- 
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Figure 9.15: Two methods for approximating the accuracy during neural architecture search. 
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Figure 9.16: Two methods for approximating the weights during neural architecture search. 


minimization problem to minimize the difference in the output feature map between a layer in 
the sample network and the corresponding layer in the trained network. Since there is a close- 
form solution for the (2-minimization problem, this step can be carried out efficiently. 
Approximating the hardware metrics (e.g., latency, energy consumption) is another im- 
portant topic if hardware metrics are involved. Evaluating hardware metrics can be slow and 
difficult to parallelize due to the limited number of available hardware devices. One popular 
method is using proxy metrics [305]. For example, the number of MAC operations is com- 
monly used for approximating the latency. However, the proxy metrics typically fail to consider 
the properties of the hardware, which results in inaccurate approximation and thus inferior per- 
formance of the searched network. Instead, building look-up tables of the hardware metrics and 
performing fast table lookup during NAS can be a promising choice [258]. Figure 9.17 illus- 
trates an example of using layer-wise look-up tables for fast latency evaluation. These layer-wise 
look-up tables contain pre-measured latency of each layer with various shapes. During NAS, we 
can look up the table of each layer, and sum up the layer-wise latency to estimate the latency of 
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Figure 9.17: This figure illustrates how layer-wise look-up tables are used for fast hardware metric 
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Figure 9.18: NetAdapt is a NAS algorithm that has been shown to be an effective approach for 
improving the Pareto frontier of accuracy-latency trade-off [309], requiring only small num- 
ber of hyperparameters to tune. It is directly guided by hardware metrics and generates a fam- 
ily of networks with different accuracy-latency trade-offs in a single run. The code is available 
at https://netadapt.mit.edu/. 


a network. This approach has been widely used with various optimization algorithms, such as 
coordinate descent [258], evolutionary algorithm [306], and gradient optimization [307]. 


9.2.4 EXAMPLE OF NEURAL ARCHITECTURE SEARCH 


We will use NetAdapt [258] as an concrete example of an NAS algorithm. NetAdapt has been 
shown to be an effective approach for improving the Pareto frontier of accuracy-latency trade- 
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off [309]. Figure 9.18 illustrates the algorithm flow of NetAdapt. We will now highlight the 
three main components of NetAdapt. 

Search space (Section 9.2.1): NetAdapt uses an initial network architecture to define the 
initial search space. The search space is composed of the network architectures derived from the 
initial network architecture by removing filters (i.e., reducing channels) of various layers. Since 
the network architectures in the search space are a subset of the initial network architecture, 
they are referred to as sub-networks. At each iteration, the search space consists of only the sub- 
networks of the best performing network from the previous iteration; this effectively shrinks the 
search space at each iteration. 

Optimization algorithm (Section 9.2.2): NetAdapt proposes using the coordinate- 
descent optimizer, which has only a few optimizer-related hyperparameters (e.g., the target 
latency per iteration). At each iteration, the optimizer samples the sub-networks of the best 
network architecture from the previous iteration with the same target latency for this iteration. 
The sampled network architecture with the best accuracy-latency trade-off will then be used to 
generate the samples in the next iteration. The best samples from each of the iterations form 
a network family with different accuracy-latency trade-offs, which enables the support of use 
cases with different latency requirements without the need to run NetAdapt multiple times. 

Performance evaluation (Section 9.2.3): while training the sampled network architec- 
tures to evaluate their accuracy numbers, NetAdapt uses the early termination technique to 
reduce the training time. Moreover, to further improve the performance of the searched net- 
work architecture, NetAdapt is guided by latency instead of number of MAC operations, and 
uses look-up tables to significantly speed up the performance evaluation. 


9.3 KNOWLEDGE DISTILLATION 


Using a DNN model with many layers or averaging the predictions of different models (i.e., 
ensemble) typically gives a better accuracy than using a single DNN model with only few layers. 
However, the computational complexity is also higher. To get the best of both worlds, a method 
called knowledge distillation is used to transfer the knowledge learned by one or more complex 
DNN models (teacher) to a simpler DNN model (student) to increase the accuracy of the stu- 
dent network without increasing its complexity. More precisely, knowledge distillation involves 
first training the complicated teacher network and using the outputs of the teacher network 
as the labels to train the simpler student network. The student network can therefore achieve 
an accuracy that would not be achievable if it was directly trained with the labels in the same 
dataset [310, 311]. For example, Hinton et al. [312] shows how using knowledge distillation 
can improve the speech recognition accuracy of a student network by 2%, which is similar to the 
accuracy of a teacher network that is composed of an ensemble of ten networks. 

Figure 9.19 shows the simplest knowledge distillation method [310]. The softmax layer 


is commonly used as the output layer in networks for image classification to generate the class 
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Figure 9.19: Knowledge distillation matches the class scores of a simple network (student) to 
those of a complex network (teacher). 


probabilities from the class scores;> it normalizes the class scores into values between 0 and 
1 that sum up to 1, where large and small values are pushed toward the two extremes, 1 and 
0, respectively. For this knowledge distillation method, the class scores of the teacher network 
are used as the targets and the objective is to minimize the difference between the targets and 
the class scores of the student network. Class scores are used as the targets instead of the class 
probabilities because the softmax layer eliminates the important information contained in the 
small class scores by pushing the corresponding class probabilities toward 0. Alternatively, if the 
softmax is configured to generate smoother class probability distribution that preserves the small 
class scores better, the class probabilities can be used as targets [312]. Finally, the intermediate 
representations of the teacher network can also be incorporated as the extra hints to train the 
student network [313]. 

Knowledge distillation can also be combined with aforementioned co-design methods in 
the previous chapters to improve accuracy-efficiency trade-off. For instance, Apprentice [314] 
combines knowledge distillation with reduced precision by using a large teacher network (e.g., 
ResNet-101) with full precision to train a smaller student network (e.g., ResNet-50) with re- 
duced precision. Knowledge distillation increases the accuracy of the reduced precision network 
by 1.5% to 3%, which helps close the gap between reduced precision and full precision networks. 


5 Also commonly referred to as logits. 
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9.4 DESIGN CONSIDERATIONS FOR EFFICIENT DNN 
MODELS 


There are several important factors to consider when applying techniques to design efficient 
network architectures for DNN models. 

First and foremost is the impact on the accuracy. This is usually difficult to evaluate be- 
cause the impact may vary from application to application and dataset to dataset. For example, 
reducing the spatial resolution of feature maps may cause less accuracy degradation for image 
classification compared to image segmentation, which requires high spatial resolution for ac- 
curate per-pixel labels. Moreover, the accuracy highly depends on the training procedure and 
hyperparameters, such as the learning rate, the usage of knowledge distillation, the degree of 
regularization, the data pre-processing, and the selection of optimization algorithm for train- 
ing. Therefore, when comparing different efficient DNN model design approaches, the impact 
on accuracy should be evaluated on the same target application and dataset with the same train- 
ing procedure and hyperparameters. 

Second, it is important to correctly evaluate the hardware metrics, such as latency and 
energy consumption. While the number of weights and MAC operations provide an easy and 
quick approximation to the hardware metrics, a reduction in the number of weights and MAC 
operations may not directly translate into a proportional reduction in latency or energy con- 
sumption. In addition, different hardware platforms have different characteristics. For example, 
feature map movement in the memory hierarchy can consume a significant amount of energy 
consumption for some hardware platforms, such as processing-in-memory accelerators discussed 
in Section 10.2. If the goal is to minimize energy consumption on this type of hardware, the 
number of MAC operations is not a good approximation because it does not consider the mem- 
ory hierarchy and data reuse. Therefore, directly measuring the target metric on the target hard- 
ware instead of using proxy metrics can enable more informed design decisions. For instance, 
processing-in-memory accelerators may prefer network architectures with larger filters and fewer 
layers to reduce feature map movement [315], which differs from conventional digital accelera- 
tors; however, solely evaluating the number of MAC operations would not provide this insight. 

Third, the design time and effort for applying a given technique to achieve the desirable 
performance should be considered. This can include the time required for tuning the hyperpa- 
rameters as explained in Section 9.2. More hyperparameters and higher degree of uncertainty 
in relationship between the hyperparameters and impact on performance can increase the time 
it takes to tune the DNN model. Therefore, the ease of use of a given technique is an important 
consideration, but unfortunately is often overlooked. 

Finally, some techniques may require additional hardware in order to realize the potential 
latency and energy consumption benefits. For instance, efficient network architectures may re- 
quire the use of a more diverse range of layer shapes, which means a hardware accelerator may 
require additional hardware to provide the flexibility to support these shapes. However, given 
a fixed area budget, any extra hardware overhead would result in a reduction in other hardware 
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resources (e.g., on-chip storage or number of PEs), which can counteractively degrade perfor- 
mance. Therefore, one should consider whether the extra hardware cost exceeds or significantly 
degrades the overall benefits of a given approach. 
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CHAPTER: 10 


Advanced Technologies 


As highlighted throughout the previous chapters, data movement dominates energy consump- 
tion. The energy is consumed both in the access to the memory as well as the transfer of the 
data. The associated physical factors also limit the bandwidth available to deliver data between 
memory and compute, and thus limits the throughput of the overall system. This is commonly 
referred to by computer architects as the “memory wall.”! 

To address the challenges associated with data movement, there have been various efforts 
to bring compute and memory closer together. Chapters 5 and 6 primarily focus on how to 
design spatial architectures that distribute the on-chip memory closer to the computation (e.g., 
scratch pad memory in the PE). This chapter will describe various other architectures that use 
advanced memory, process, and fabrication technologies to bring the compute and memory together. 

First, we will describe efforts to bring the off-chip high-density memory (e.g., DRAM) 
closer to the computation. ‘These approaches are often referred to as processing near memory or 
near-data processing, and include memory technologies such as embedded DRAM and 3-D 
stacked DRAM. 

Next, we will describe efforts to integrate the computation info the memory itself. These 
approaches are often referred to as processing in memory or in-memory computing, and include 
memory technologies such as Static Random Access Memories (SRAM), Dynamic Random 
Access Memories (DRAM), and emerging non-volatile memory (NVM). Since these ap- 
proaches rely on mixed-signal circuit design to enable processing in the analog domain, we will 
also discuss the design challenges related to handling the increased sensitivity to circuit and de- 
vice non-idealities (e.g., nonlinearity, process and temperature variations), as well as the impact 
on area density, which is critical for memory. 

Significant data movement also occurs between the sensor that collects the data and the 
DNN processor. The same principles that are used to bring compute near the memory, where 
the weights are stored, can be used to bring the compute near the sensor, where the input data is 
collected. Therefore, we will also discuss how to integrate some of the compute in/o the sensor. 

Finally, since photons travel much faster than electrons and the cost of moving a photon 
can be independent of distance, processing in the optical domain using light may provide signifi- 
cant improvements in energy efficiency and throughput over the electrical domain. Accordingly, 
we will conclude this chapter by discussing the recent work that performs DNN processing in 
the optical domain, referred to as Optical Neural Networks. 


1 Specifically, the memory wall refers to data moving between the off-chip memory (e.g., DRAM) and the processor. 
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Table 10.1: Example of recent works that explore processing near memory. For I/O, TSV refers 
to through-silicon vias, while TCI refers to ThruChip Interface which uses inductive coupling. 
For bandwidth, ch refers to number of parallel communication channels, which can be the num- 
ber of tiles (for eDRAM) or the number of vaults (for stacked memory). The size of stacked 
DRAM is based on Hybrid Memory Cube (HMC) Gen2 specifications. 


Technology Bandwidth Evaluation 
DaDianNao [149] eDRAM 18 ch x 310 GB/s = 5580 GB/s Simulated 
Neurocube [313] | Stacked DRAM 16 ch x 10 GB/s = 160 GB/s Simulated 








Tetris [314] Stacked DRAM 16 ch x 8 GB/s = 128 GB/s Simulated 
Quest [315] Stacked SRAM 24 ch x 1.2 GB/s = 28.8 GB/s Measured 
N3XT [316] Monolithic 3-D 16 ch x 48 GB/s = 768 GB/S Simulated 


























10.1 PROCESSING NEAR MEMORY 


High-density memories typically require a different process technology than processors and as a 
result are often fabricated as separate chips; as a result, accessing high-density memories requires 
going off-chip. The bandwidth and energy cost of accessing high-density off-chip memories 
are often limited by the number of I/O pads per chip and the off-chip interconnect channel 
characteristics (i.e., its resistance, inductance, and capacitance). Processing near memory aims 
to overcome these limitations by bringing the compute near the high-density memory to reduce 
access energy and increase memory bandwidth. The reduction in access energy is achieved by 
reducing the length of the interconnect between the memory and compute, while the increase in 
bandwidth is primarily enabled by increasing the number of bits that can be accessed per cycle 
by allowing for a wider interconnect and, to a lesser extent, by increasing the clock frequency, 
which is made possible by the reduced interconnect length. 

Various recent advanced memory technologies aim to enable processing near memory with 
differing integration costs. Table 10.1 summarizes some of these efforts, where high-density 
memories on the order of tens of megabytes to gigabytes are connected to the compute engine 
at bandwidths of tens to hundreds of gigabytes per second. Note that currently most academic 
evaluations of DNN systems using advanced memory technologies have been based on simula- 
tions rather than fabrication and measurements. 

In this section, we will describe the cost and benefits of each technology and provide 
examples of how they have been used to process DNNs. The architectural design challenges of 
using processing-near-memory include how to allocate data to memory since the access patterns 
for high-density memories are often limited (e.g., data needs to be divided into different banks 
and vaults in the DRAM or stacked DRAM, respectively), how to design the network-on- 
chip between the memory and PEs, how to allocate the chip area between on-chip memory 


10.1. PROCESSING NEAR MEMORY 255 


and compute now that off-chip communication less expensive, and how to design the memory 
hierarchy and dataflow now that the data movement costs are different. 


10.1.1 EMBEDDED HIGH-DENSITY MEMORIES 


Accessing data from off-chip memory can result in high energy cost as well as limited mem- 
ory bandwidth (due to limited data bus width due to number of I/O pads, and signaling 
frequency due to the channel characteristics of the off-chip routing). Therefore, there has 
been a significant amount of effort toward embedding high-density memory on-chip. This in- 
cludes technology such as embedded DRAM (eDRAM) [320] as well as embedded non-volatile 
(eNVM) [321], which includes embedded Flash (eF lash) [322], magnetic random-access mem- 
ory (MRAM) [323], resistive random-access memory (RRAM) [324, 325], and phase change 
memory (PCRAM) [326]. 

In DNN processing, these high-density memories can be used to store tens of megabytes 
of weights and activations on chip to reduce off-chip access. For instance, DaDianNao [152] 
uses 16x2MB eDRAM tiles to store the weights and 2x2MB eDRAM tiles to store the input 
and output activations; furthermore, all these tiles (each with 4096-bit rows) can be accessed 
in parallel, which gives extremely high memory bandwidth.” The downside of DRAM is that 
it has a lower density than off-chip DRAM and can increase the fabrication cost of the chip. 
In addition, it has been reported that eDRAM scaling is slower than SRAM scaling [327], 
and thus the density advantage of eDRAM over SRAM will reduce over time. In contrast, 
eNVMs have gained popularity in recent years due to its increased density as well as its non- 
volatility properties and reduction in standby power (e.g., leakage, refresh, etc.) compared to 
eDRAM [327]. 


10.1.2 STACKED MEMORY (3-D MEMORY) 


Rather than integrating DRAM into the chip itself, the DRAM can also be stacked on top of 
the chip using through-silicon vias (TSVs). This technology is often referred to as 3-D memory,’ 
and has been commercialized in the form of Hybrid Memory Cube (HMC) [328] and High 
Bandwidth Memory (HBM) [122]. 3-D memory delivers an order of magnitude higher band- 
width and reduces access energy by up to 5x relative to existing 2-D DRAMs, as TSVs have 
lower capacitance than typical off-chip interconnects. 

Recent works have explored the use of HMC for efficient DNN processing in a variety of 
ways. For instance, Neurocube [316], shown in Figure 10.1a, uses HMC to bring the memory 
and computation closer together. Each DRAM vault (vertically stacked DRAM banks) is con- 
nected to a PE containing a buffer and several MACs. A 2-D mesh network-on-chip (NoC) is 


2DaDianNao [152] assumes that the DNN model can fit into the 32MB of eDRAM allocated to the weights. In practice, 
this implies that the design either limits the size of DNN model, or requires access to off-chip memory if the size of the DNN 
model exceeds the capacity of the eDRAM. 


3Also referred to as “in-package” memory since both the memory and compute can be integrated into the same package. 


256 10. ADVANCED TECHNOLOGIES 





Memory Vault 


Inductive Coupling 
(TCI) Channel 





Logic Die 





Processing Engine 


(a) Neurocube (figure from [313]) (b) Quest (figure from [326]) 


Figure 10.1: Stacked memory systems. (a) DRAM using through-silicon vias (TSV) and 
(b) SRAM using inductive coupling. 


used to connect the different PEs, allowing the PEs to access data from different vaults. One 
major design decision involves determining how to distribute the weights and activations across 
the different vaults to reduce the traffic on the NoC. 

Another example that uses HMC is Tetris [317], which explores the use of HMC with 
the Eyeriss spatial architecture and row-stationary dataflow. It proposes allocating more area to 
computation than on-chip memory (i.e., larger PE array and smaller global buffer) in order to 
exploit the low-energy and high-throughput properties of the HMC. It also adapts the dataflow 
to account for the HMC and smaller on-chip memory. 

SRAM can also be stacked on top of the chip to provide 10x lower latency compared 
to DRAM [318]. For instance, Quest [318], shown in Figure 10.1b, uses eight 3-D stacked 
SRAM dies to store both the weights and the activations of the intermediate feature maps when 
processing layer by layer. The SRAM dies are connected to the chip using inductive-coupling 
die-to-die wireless communication technology, known as a ThruChip Interface (TCI) [330], 
which has lower integration cost than TSV. 

The above 3-D memory designs involve using TSV or TCI to connect memory and logic 
dies that have been separately fabricated. Recent breakthroughs in nanotechnology have made it 
feasible to directly fabricate thin layers of logic and memory devices on top of each other, referred 
to as monolithic 3-D integration. Interlayer vias (ILVs), which have several orders of magnitude 
denser vertical connectivity than TSV, can then be used to connect the memory and compute. 
Current monolithic 3-D integration systems, such as N3XT, use on-chip non-volatile mem- 
ory (e.g., resistive RAM (RRAM), spin-transfer torque RAM (STT-RAM)/magnetic RAM 
(MRAM), phase change RAM (PCRAM)), and carbon nanotube logic (CNFET). Based on 
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Figure 10.2: Comparison of conventional processing and processing in memory. 


simulations, the energy-delay product of ILVs can be up to two orders of magnitude lower than 
2-D systems on deep neural network workloads, compared to 8x for TSV [319].* 

In order to fully understand the impact of near memory processing it is important to 
analyze the impact that the added storage layer has on the mappings that are now available. 
Specifically, the new memories are faster, but also smaller, so optimal mappings will be different. 


10.2 PROCESSING IN MEMORY 


While the previous section discussed methods to bring the compute near the memory, this sec- 
tion discusses processing in memory, which brings the compute ino the memory. We will first 
highlight the differences between processing in memory and conventional architectures, then 
describe how processing in memory can be performed using different memory technologies 
including NVM, SRAM, and DRAM. Finally, we will highlight various design challenges as- 
sociated with processing-in-memory accelerators that are commonly found across technologies. 

DNN processing can be performed using matrix-vector multiplication (see Figures 4.2 
and 4.3), as discussed in Chapter 4. For conventional architectures, both the input activation 
vector and the weight matrix are read out from their respective memories and processed by a 
MAC array, as shown in Figure 10.2a; the number of weights that can be read at once is limited 
by the memory interface (e.g., the read out logic and the number of memory ports). This limited 
memory bandwidth for the weights (e.g., a row of A weights per cycle in Figure 10.2b) can 
also limit the number of MAC operations that can be performed in parallel (i.e., operations per 
cycle) and thus the overall throughput (i.e., operations per second). 


‘The savings are highest for DNN models and configurations with low amounts of data reuse (e.g., FC layers with small 
batch size) resulting in more data movement across ILV. 
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Processing-in-memory architectures propose moving the compute into the memory that 
stores the weight matrix, as shown in Figure 10.2b. This can help reduce the data movement of 
the weights by avoiding the cost of reading the weight matrix; rather than reading the weights, 
only the computed results such as the partial sums or the final output activations are read out 
of the memory. Furthermore, processing in memory architectures can also increase the memory 
bandwidth, as the number of weights that can be accessed in parallel is no longer limited by 
the memory interface; in theory, the entire weight matrix (e.g., A x B in Figure 10.2b) could be 
read and processed in parallel. 

Figure 10.3 shows a weight-stationary dataflow architecture that is typically used for pro- 
cessing in memory. The word lines (WLs) are used to deliver the input activations to the storage 
elements, and the bit lines (BLs) are used to read the computed output activations or partial 
sums. The MAC array is implemented using the storage elements (that store the weights), where 
a multiplication is performed at each storage element, and the accumulation across multiple stor- 
age elements on the same column is performed using the bit line. In theory, a MAC array of 
B rows of A elements can access all A x B weights at the same time, and perform up to A dot 
products in parallel, where each sums B elements (i.e., A x B MAC operations per cycle). 

Similar to other weight-stationary architectures, the input activations can be reused across 
the different columns (up to A times for the example given in Figure 10.3), which reduces num- 
ber of input activation reads. In addition, since a storage element tends to be smaller in area 
than the logic for a digital MAC (10 to 100x smaller in area and 3 to 10x smaller in edge 
length [331]), the routing capacitance to deliver the input activations can also be reduced, which 
further reduces the energy cost of delivering the input activations. Depending on the format of 
the inputs and outputs to the array, digital-to-analog converters (DACs) and analog-to-digital 
converters (ADCs) may also be required to convert the word line and bit line values, respectively; 
the cost of the DAC scales with the precision of the input activations driven on the word line, 
while the cost of the ADC scales with the precision of the partial sums, which depends on the 
precision of the weights and input activations, and the number of values accumulated on the bit 
line (up to B).° 

An alternative way to view processing in memory is to use the loop nest representation 
introduced in Chapter 5. Design 10.20 illustrates a processing-in-memory design for an FC layer 
with M output channels and where the input activations are flattened along the input channel, 
height and width dimensions (CHW). The computation take place in one cycle computing all 
the results in a single cycle in line 7. For this design, some of the mapping constraints are that 


>The number of bits that an ADC can correctly resolve also depends on its thermal noise (typically some multiple of 
kT/C, where k is the Boltzmann constant, T is the temperature, and C is the capacitance of the sampling capacitor). For 
instance, an N-bit ADC has 2 —! decision boundaries (see Section 7.2.1). However, if the thermal noise is large, the location 
of the 2M7! decision boundaries will move around, dynamically and randomly, and this will affect the resulting accuracy of 
the DNN being processed. Therefore, designing a low noise ADC is an important consideration. Note that the thermal noise 
of the ADC scales with the power consumption and the area of the ADC. Accordingly, it is important that the ADC’s thermal 
noise be considered when evaluating the accuracy as demonstrated in [332-334], as the design of the ADC involves a trade-off 
between power, area, and accuracy. 
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Figure 10.3: Typical dataflow for processing-in-memory accelerators. 


A> M andB > C x H x W.® Note, that when A 4 M or B 4 C x H x W under-utilization 
will occur, as described in Section 10.2.4. 

A processing in memory design can also handle convolutions as illustrated in the loop 
nest in Design 10.21. Here, we show a toy design of just a 1-D convolution with multiple 


6For this example, we disallow the cases where A < M or B < C x H x W, since that would require multiple passes 
and updates of the weights, which reduces the potential benefits of processing in memory. 
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Design 10.20 FC layer for Processing in Memory 





1 i = Array(CHW) # Input activations 

2 f = Array(M, CHW) # Filter weights 

3 o = Array(M) # Output partial sums 
5 parallel—for m in [0, M): 

6 parallel—for chwin [0, CHW): 

7 o[m] += i[chw] * f[m, chw] 





Design 10.21 1-D Weight-Stationary Convolution Dataflow for Processing in Memory 





1 i = Array(C, W) # Input activations 

2 f = Array(M, C, S) # Filter weights 

3 o = Array(M, Q) # Output partial sums 

5 parallel—for min [0, M): 

6 parallel—for s in [0, S): 

7 parallel—for c in [0, C]: 

s for q in [0, Q): 

9 w=qts 

o olm, q] += i[c, w] * f[m, c, s] 





input channels (C) and multiple output channels (M). The entire computation takes Q steps 
as the only temporal step is the for loop (line 8). Interpreting the activity in the body of the 
loop (line 10), we see that in each cycle all filter weights are used (M x S x C) each as part a 
distinct MAC operation, the same input activation is used multiple times (C x S) and multiple 
output partial sums are accumulated into (M). This design reflects the Toeplitz expansion of the 
input activations (see Section 4.1), so the same input activations will be delivered multiple times, 
since the same value for the input activation index w will be generated for different qs. For the 
processing in memory convolution design, some of the mapping constraints are that A > M and 
B > C x S. Note, that when A 4 M or B Æ C x S under-utilization will occur, as described 
in Section 10.2.4. 

In the next few sections (Sections 10.2.1, 10.2.2, and 10.2.3), we will briefly describe how 
processing in memory can be performed using different memory technologies. Section 10.2.4 
will then give an overview of some of the key design challenges and decisions that should be 
considered when designing processing-in-memory accelerators for DNNs. For instance, many 
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of these designs are limited to reduced precision (i.e., low bit-width) due to the non-idealities 
of the devices and circuits used for memories. 


10.2.1 NON-VOLATILE MEMORIES (NVM) 


Many recent works have explored enabling processing-in-memory using non-volatile memories 
(NVM) due to their high density and thus potential for replacing off-chip memory and reduc- 
ing off-chip data movement. Advanced non-volatile high-density memories use programmable 
resistive elements, commonly referred to as memristors [335], as storage elements. These NVM 
devices enable increased density since memory and computation can be densely packed with a 
similar density to DRAM [336].’ 

Non-volatile memories exploit Ohm's law by using the conductance (i.e., the inverse of the 
resistance) of a device to represent a filter weight and the voltage across the device to represent 
the input activation value. So the resulting current can be interpreted as the product (i.e., a 
partial sum). This is referred to as a current-based approach. For instance, Figure 10.4a shows 
how a multiplication can be performed using the conductance of the NVM device as the weight, 
and the voltage on the word line as the input activation, and the current output to the bit line 
as the product of the two. The accumulation is done by summing the currents on the bit line 
based on Kirchhoff’s current law. Alternatively, for Flash-based NVM, the multiplication is 
performed using the current-voltage (IV) characteristics of the floating-gate transistor, where 
the threshold voltage of the floating-gate transistor is set based on the weight, as shown in 
Figure 10.4c. Similar to the previously described approaches, a voltage proportional to the input 
activation can be applied across the device, and the accumulation is performed by summing 
output current of the devices on the bit line. 

NVM based processing-in-memory accelerators have several unique challenges, as de- 
scribed in [340, 341]. First, the cost of programming the memristors (i.e., writing to non-volatile 
memory) can be much higher than SRAM or DRAM; thus, typical solutions in this space re- 
quire that the non-volatile memory to be sufficiently large to hold a// weights of the DNN 
model, rather than changing the weights in the memory for each layer or filter during process- 
ing.® As discussed in Chapter 3, this may reduce flexibility as it can limit the size of the DNN 
model that the accelerator can support. 

Second, the NVM devices can also suffer from device-to-device and cycle-to-cycle vari- 
ations with nonlinear conductance across the conductance range [340-342]. This affects the 
number of bits that can be stored per device (typically 1 to 4) and the type of signaling used 
for the input and output activations. For instance, rather than encoding the input activation in 
terms of voltage amplitude, the input can also be encoded in time using pulse width modulation 


TTo improve density, the resistive devices can be inserted between the cross-point of two wires and in certain cases can 
avoid the need for an access transistor [337]. Under this scenario, the device is commonly referred to as a cross-point element. 
8This design choice to hold all weights of the DNN is similar to the approach taken in some of the FPGA designs such 
as Brainwave [209] and FINN [226], where the weights are pinned on the on-chip memory of the FPGA during synthesis. 
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Figure 10.4: Performing a multiplication and accumulation using the storage element. Input 
activation is encoded as a voltage amplitude (V;). (a) For memristors, G; is the conductance 
(i.e., 1/resistance) of a resistive device set according to the weight, and bit line current J is 
the accumulated partial sum value [329]. (b) The current-voltage (I-V) characteristics of the 
resistive device. The slope of the curve is inversely proportional to the resistance (recall R = 
V/I). Typically, the device can take on just two states: LRS is the low resistive state (also referred 
to as Roy) and HRS is the high resistive state (also referred to as Rorr). (c) and (d) For floating- 
gate transistors, the multiplication is performed using its current-voltage (I-V) characteristics, 
where the weight sets the threshold voltage (as illustrated by the different color lines representing 
different threshold voltages), and bit line current 7 is the accumulated partial sum value [339]. 
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with a fixed voltage (i.e., a unary coding), and the resulting current can be accumulated over time 
on a capacitor to generate the output voltage [343]. 

Finally, the NVM devices cannot have negative resistance, which presents a challenge 
for supporting negative weights. One approach is to represent signed weights using differen- 
tial signaling that requires two storage elements per weight; accordingly, the weights are often 
stored using two separate arrays [344]. Another approach is to avoid using signed weights. For 
instance, in the case of binary weights, rather than representing the weights as [—1, 1] and per- 
forming binary multiplications, the weights can be represented as [0, 1] and perform XNOR 
logic operations, as discussed in Chapter 7, or NAND logic operations, as discussed in [345]. 

There are several popular candidates for NVM devices including phase change RAM 
(PCRAM), resistive RAM (RRAM or ReRAM), conductive bridge RAM (CBRAM), and 
spin transfer torque magnetic RAM (STT-MRAM) [346]. These devices have different trade- 
offs in terms of endurance (i.e., how many times it can be written), retention time (i.e., how 
often it needs to be refreshed and thus how frequently it needs to be written), write current (i.e., 
how much power is required to perform a write), area density (i.e., cell size), variations, and 
speed. An in-depth discussion of how these device properties affect the performance of DNN 
processing can be found in [341]; Gokmen et al. [343] flips the problem and describes how these 
devices should be designed such that they can be better suited for DNN processing.” 

Recent works on NVM-based processing-in-memory accelerators have reported results 
from both simulation [329, 338, 347, 348] as well as fabricated test chips [344, 349]. While 
works based on simulation demonstrate functionality on large DNN models such as variants of 
VGGNet [73] for image classification on ImageNet, works based on fabricated test chips still 
demonstrate functionality on simple DNN models for digit classification on MNIST [344, 349]. 
Simulations often project capabilities beyond the current state-of-the-art. For instance, while 
works based on simulation often assume that all 128 or 256 rows can be activated at the same 
time, works based on fabricated test chips only activate up to 32 rows at once to account for 
process variations and limitations in the read out circuits (e.g., ADC); these limitations will be 
discussed more in Section 10.2.4. It should also be noted that fabricated test chips typically only 
use one bit per memristor [344, 349, 350]. 


10.2.2 STATIC RANDOM ACCESS MEMORIES (SRAM) 


Many recent works have explored the use of the SRAM bit cell to perform computation. They 
can be loosely classified into current-based and charge-based designs. 

Current-based designs use the current-voltage (IV) characteristics of the bit cell to per- 
form a multiplication, which is similar to the NVM current-based approach described in Sec- 
tion 10.2.1. For instance, Figure 10.5a shows how the input activation can be encoded as a 
voltage amplitude on the word line that controls the current through the pull-down network of 


9[341, 343] also describe how these devices might be used for training DNNs if the weights can be updated in parallel 


and in place within the memristor array. 
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a bit cell (Zgc) resulting in a voltage drop (Vg) proportional to the word line voltage [351]. The 
current from multiple bit cells (across different rows on the same column) add together on the 
bit line to perform the accumulation [351]. The resulting voltage drop on the bit line is then 
proportional to the dot product of the weights and activations of the column. 

The above current-based approach is susceptible to the variability and nonlinearity of the 
word line voltage-to-current relationship of the pull-down network in the bit cell; this create 
challenges in representing the weights precisely. Charge-based approaches avoid this by using 
charge sharing for the multiplication, where the computation based on the capacitance ratio be- 
tween capacitors, which tends to be more linear and less sensitive to variations. 

Figure 10.5b shows how a binary multiplication (i.e., XNOR) via charge sharing can be 
performed by conditionally charging up a local capacitor within a bit cell, based on the XNOR 
between the weight value stored in the bit cell and the input activation value that determines 
the word line voltage [352]. Accumulation can then be performed using charge sharing across 
the local capacitors of the bit cells on a bit line [352]. Other variants of this approach include 
performing the multiplication directly with the bit line [353], and charge sharing across different 
bit lines to perform the accumulation [353-355]. 

One particular challenge that exists for SRAM-based processing-in-memory accelerators 
is maintaining bit cell stability. Specifically, the voltage swing on the bit line typically needs to 
be kept low in order to avoid a read disturb (i.e., accidentally flipping the value stored in the 
bit cell when reading). This limits the voltage swing on the bit line, which affects the number 
of bits that can be accumulated on the bit line for the partial sum; conventional SRAMs only 
resolve one bit on the bit line. One way to address this is by adding extra transistors to isolate 
the storage node in the bit cell from the large swing of the bit line [353]; however, this would 
increase the bit cell area and consequently reduce the overall area density. 

Recent works on SRAM-based processing-in-memory accelerators have reported results 
from fabricated test chips [351-355]. In these works, they demonstrate functionality on simple 
DNN models for digit classification on MNIST, often using layer-by-layer processing, where 
the weights are updated in the SRAM for each layer. Note that in these works, the layer shapes 
of the DNN model are often custom designed to fit the array size of the SRAM to increase 
utilization; this may pose a challenge in terms of flexibility, as discussed in Chapter 3.1° 


10.2.3 DYNAMIC RANDOM ACCESS MEMORIES (DRAM) 


Recent works have explored how processing in memory may be feasible using DRAM by per- 
forming bit-wise logic operations when reading multiple bit cells. For instance, Figure 10.6 
shows how AND and OR operations can be performed by accessing three rows in parallel [356]. 
When three rows are accessed at the same time, the output bit line voltage will depend on the 


10Tt should be noted that since SRAM is less dense than typical off-chip memory (e.g., DRAM), they are not designed to 
replace off-chip memory or specifically addressing the “memory wall,” which pertains to off-chip memory bandwidth; instead, 
most SRAM-based processing-in-memory accelerators focus on reducing the on-chip data movement. 
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(a) Multiplication using a 6T SRAM bit-cell and accumulation by current summing on 
bit lines (figure adapted from [348]) 
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(b) Multiplication using a 8T SRAM bit-cell and a local capacitor and accumulation 
using charge sharing across local capacitors (figure adapted from [349]) 


Figure 10.5: Performing a multiplication and accumulation using the storage element. (a) Mul- 
tiplication can be performed using a SRAM bit-cell by encoding the input activation as a voltage 
amplitude on the word line that controls the current through the pull-down network of the bit 
cell (Jgc) resulting in a voltage drop (Vg,) proportional to the word line voltage. If a zero (weight 
value of —1) is stored in the bit cell, the voltage drop occurs on BL, while if a one (weight value 
of +1) is stored the voltage drop occurs on BLB. ‘The current from multiple bit-cells within a 
column add together. (b) Binary multiplication (XNOR) is performed by connection transis- 
tors and local capacitor. Accumulation is performed by charge sharing across local capacitors in 
bit-cells from the same column. 
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Figure 10.6: Compute in DRAM based on charge sharing. Z controls whether an AND or OR is 
performed on input X and Y. At time t = 0, the local capacitor of the bit cells for X, Y , and Z are 
charged to Vpp for one and 0 for zero, and the bit line is pre-charged to Vpp/2. At time t = 1, the 
accessed transistors to the bit cells are enabled, and the capacitors are shorted together with the 
bit line. Charge sharing distributes the charge between the capacitors to ensure that the voltage 
across each capacitor is the same; therefore the resulting voltage on the bit line is proportional to 
the average charge across the three capacitors. If the majority of the capacitors stored at one (i.e., 
Vpp), then the voltage on the bit line would be above Vpp/2 (i.e., +8); otherwise, the voltage on 
the bit line drops below Vpp/2 (i.e., —5). At time t = 2, the sense amplifiers (SA) on the bit line 
amplify the voltage to full swing (i.e., Vop/2 + 6 becomes Vpp or Vpp/2 — 6 becomes 0), such 
that the output of the logic function XY + YZ + ZX can be resolved on the bit line. Note that 
this form of computing is destructive, so we need to copy data beforehand. 
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average of the charge stored in the capacitors of the bit cells in three rows (note that the charge 
stored in capacitor of a bit cell depends on if the bit cell is storing a one or zero). Therefore, 
if the majority of the values of the bit cells are one (at least two out of three), then the output 
is a one; otherwise, the output is a zero. More precisely, if X, Y, and Z represent the logical 
values of the three cells, then the final state of the bit line is XY + YZ + ZX. If Z = 1, then 
this is effectively an OR operation between X and Y; if Z = 0, then this is effectively an AND 
operation between X and Y. The bit-wise logic operations can be built up into MAC operations 
across multiple cycles [357], similar to bit-serial processing described in Chapter 7. 

It is important to note that the architecture of processing in memory with DRAM differs 
from the processing in memory with NVM and SRAM (described in Sections 10.2.1 and 10.2.2, 
respectively) in that: (1) for DRAM, a bit-wise operation requires three storage elements from 
different rows, whereas for NVM and SRAM, a MAC operation can be performed with a single 
storage element; and (2) for DRAM, only one bit-wise operation is performed per bit line and 
the accumulation occurs over time, whereas for NVM and SRAM, the accumulation of multiple 
MAC operations is performed on the bit line.’ As a result, for DRAM the parallel processing 
can only be enabled across bit lines (A in Figure 10.3), since only one operation can be performed 
per bit line, whereas for NVM and SRAM, the parallel processing can be enabled across both the 
bit lines and the word lines (A and B in Figure 10.3), since multiple operations can be performed 
per bit line. In addition, for DRAM, multiple cycles are required to build up a MAC operation 
from a bit-wise logic operation, which reduces throughput. Thus, a challenge for DRAM-based 
processing-in-memory accelerators is to ensure that there is sufficient parallelism across bit lines 
(A) to achieve the desired improvements in throughput. 

Other challenges for DRAM-based processing-in-memory accelerators include variations 
in the capacitance in the different bit cells, changing charge in capacitor of bit cell over time due 
to leakage, and detecting small changes in the bit line voltage. In addition, additional hardware 
may be required in the memory controller to access multiple rows at once and/or to convert 
the bit-wise logic operations to MAC operation, all of which can contribute to energy and area 
overhead. 

While many of the recent works on DRAM-based processing-in-memory accelerators 
have been based on simulation [356, 357], it should be noted that performing AND and OR 
operations have been demonstrated on off-the shelf, unmodified, commercial DRAM [359]. 
‘This was achieved by violating the nominal timing specification and activating multiple rows in 
rapid succession, which leaves multiple rows open simultaneously and enables charge sharing on 
the bit line. 


10.2.4 DESIGN CHALLENGES 


Processing-in-memory accelerators offer many potential benefits including reduced data move- 
ment of weights, higher memory bandwidth by reading multiple weights in parallel, higher 


This bit-wise (bit-serial) approach has also been explored for SRAM [358]. 
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Design 10.22 Toy matrix multiply loop nest 





1 i = Array(CHW) # Input activations 

2 f = Array(CHW, M) # Filter weights 

3 o = Array(M) # Output partial sums 
5 parallel—for m in [0, M): 

6 parallel—for chwin [0, CHW): 

7 olm] += i[chw] * f[chw, m] 





throughput by performing multiple computations in parallel, and lower input activation deliv- 
ery cost due to increased density of compute. However, there are several key design challenges 
and decisions that need to be considered in practice. Analog processing is typically required to 
bring the computation into the array of storage elements or into its peripheral circuits; there- 
fore the major challenges for processing in memory are its sensitivity to circuit and device non- 
idealities (i.e., nonlinearity and process, voltage and temperature variations).'? Solutions to these 
challenges often require trade offs between energy efficiency, throughput, area density, and ac- 
curacy,” which reduce the achievable gains over conventional architectures. Architecture-level 
energy and area estimation tools such as Accelergy can be used to help evaluate some of these 
trade offs [360]. 

In this section, when applicable we will use a toy example of a matrix vector multiplication 
based on a FC layer shown in Figure 10.7. A loop-nest representation of the design is shown 
in Design 10.22, where CHW = M = 4. In theory, the entire computation should only require 
one cycle as all the 16 weights can be accessed in parallel and all the 16 MAC operations can be 
performed in parallel. 


Number of Storage Elements per Weight 

Ideally, it would be desirable to be able to use one storage element (i.e., one device or bit cell) 
per weight to maximize density. In practice, multiple storage elements are required per weight 
due to the limited precision of each device or bit cell (typically on the order of 1 to 4 bits). As a 
result, multiple low-precision storage elements are used to represent a higher precision weight. 
Figure 10.8 shows how this applies to our toy example. 


Note that per chip training (i.e., different DNN weights per chip instance) may help address nonlinearity and chip to 
chip variability, but is expensive in practice. In addition, while adapting the weights can help address static variability, dynamic 
variability, such as a change in temperature, remains a challenge. 

13Tt should be noted that the loss in accuracy might not only be due to the reduced precision of the computations in the 
DNN model (discussed in Chapter 7), which can be replicated on a conventional processor, but also due to circuit/device non- 
idealities and limitations, including ADC precision and thermal noise. Unfortunately, these factors have rarely been decoupled 
during reporting in literature, which can make it difficult to understand the design trade offs. 
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Figure 10.7: Toy example of matrix vector multiplication for this section. This example uses an 
FC layer with N = 1, CHW = 4, and M = 4. 
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Figure 10.8: Example of multiple storage elements per weight. In our toy example we use 2 bits 
per weight so the storage cost goes from 4 x 4 to 4 x 8. 








For non-volatile memories (e.g., RRAM), multiple storage elements can also be used per 
weight to reduce the effect of devices variation (e.g., average 3 x 3 devices per weight [342]) or 
to represent a signed weight (i.e., since resistance is naturally non-negative, differential coding 
using two arrays is often used [342]). Finally, in the case of SRAMs, often additional transistors 
are required in the bit cell to perform an operation, which increases the area per bit cell. All of 
the above factors reduce the density and/or accuracy of the system. 
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Array Size 
Ideally, it would be desirable to have a large array size (A x B) in order to allow for high weight 
read bandwidth and high throughput. In addition, a larger array size improves the area density 
by further amortizing the cost of the peripheral circuits, which can be significant (e.g., the pe- 
ripheral circuits, i.e., ADC and DAC, can account for over 50% of the energy consumption of 
NVM. -based designs [329, 349]). In practice, the size of array limited by several factors. 


1. The resistance and capacitance of word line and bit line wires, which impacts robustness, 
speed, and energy consumption. 


For instance, the bit line capacitance impacts robustness for charge domain approaches 
where charge sharing is used for accumulation, as a large bit line capacitance makes it 
difficult to sense the charge stored on the local capacitor in the bit cell; the charge stored on 
the local capacitor can be an input value for DRAM-based designs or a product of weight 
and input activation for SRAM-based designs. An example of using charge sharing to 
sense the voltage across a local capacitor is shown in Figure 10.9. Specifically, the change 
in bit line voltage (AVzr) is 


Clocal 


AVpr = (V Visca ; 
a ( i 1) Ciocal + CBL 


(10.1) 





where Ciocai and Cpz are the capacitance of the local capacitor and bit line, respectively, 
and Vica is the voltage across the local capacitor (due to the charge stored on the local 
capacitor), and Vpp is the supply voltage. If the local capacitor is only storing binary values, 
then Viocai can either be Vpp or 0. AVg, must be sufficiently large such that we can measure 
any change in Visca; the more bits we want to measure on the bit line (i.e., bits of the 
partial sum or output activation), the larger the required AVgz. However, the size of Ciocai 
is limited by the area density of the storage element; for instance, in [352], Coca is limited 
to 1.2fF. As a result, the minimum value of AVgz limits the size of Cgz, which limits the 
length of the bit line. 


Similarly, the bit line resistance impacts robustness for current domain approaches where 
current summing is used for accumulation, as a large bit line resistance makes it diffi- 
cult to sense the change in the resistance in the NVM device, as shown in Figure 10.10. 
Specifically, the change in bit line voltage due to change on the resistance is 





Rorr — Ron 
AVe = Viicu — Viow = Vin Rat on Rad Roe Ray (10.2) 
where Roy and Rorr are the minimum and maximum resistance of the NVM device (pro- 
portional to the weight), respectively, Rg, is the resistance of the bit line, and V;,, is the 
input voltage (proportional to the input activation). The Rorr — Ron is limited by the 
NVM device [342]. As a result, the minimum value of AVgz limits the size of Rgz, which 
again limits the length of the bit line. 
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Figure 10.9: Change in bit line voltage AVgz is proportional to q-“—. The bit line is 


precharged to Vpp at t = 0, and we read the value on the local capacitor at t = 1. 


BL = VHIGH BL = Viow 


WLo = Vin WLo = Vin 
Ron Rorr 
Rpt RBL 


Figure 10.10: Change in bit line voltage AVg_ = VmıGH — Vrow is proportional to 
Rei TR AN yry: Ron (also referred to as LRS) and Rorr (also referred to as 
HRS) are the minimum and maximum resistance of the NVM device, respectively. 





2. The utilization of the array will drop if the workload cannot fill entire column or entire 
row, as shown in Figure 10.11a. Ifthe DNN model has few weights per filter and does not 
require large dot products, e.g., C x H x W < B, where C, H, and W, are the dimensions 
of the filter (FC layer), and B is the number of rows in the array, then there will be B — 
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Figure 10.11: Array utilization. (a) Impact of array size on utilization. (b) Example of utilization 
if size of weight memory was 8 x 8. Even though in theory we should be able to perform 64 
MAC operations in parallel, only 16 of the storage elements are used (utilization of 25%); as 
a result, only 16 MAC operations are performed in parallel, specifically, 4 dot products of 4 
elements. 


C x H x W idle rows in the array. If the DNN model has few output channels and does 
not have many dot products, e.g., M < A, where M is the number of output channels 
and A is the number of columns in the array, then there will be A— M idle columns 
in the array.“ This becomes more of an issue when processing efficient DNN models as 
described in Chapter 9, where the trend is to reduce the number of weights per filter. In 
digital designs, flexible mapping can be used to increase utilization across different filter 
shapes, as discussed in Chapter 6; however, this is much more challenging to implement in 
the analog domain. One option is to redesign the DNN model specifically for processing 
in memory with larger filters and fewer layers [315], which increases utilization of the 
array and reduces input activation data movement; however, the accuracy implications of 


M4Note that if Cx H x W > B or M > A, temporal tiling will need to be applied, as discussed in Chapter 4, and 
multiple passes (including updating weights in the array) will be required to complete the MAC operations. Furthermore, 
recall that if the completed sum (final psum) can be computed within a single pass (i.e., C x H x W < B), then precision 
of the ADC can be reduced to the precision of the output activation. However, when multiple passes are needed, the ADC 
needs greater precision because the results of each pass need to be added together to form the completed sum; otherwise, there 
may be an accuracy loss. 
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such DNN models requires further study. Figure 10.11b shows how this applies to our toy 
example. 


As a result, typical fabricated array sizes range from 16b x 64 [353] to 512b x 512 [352] 
for SRAM and from 128 x 128 to 256 x 256 [342] for NVM. This limitation in array size 
affects throughput, area density and energy efficiency. Multiple arrays can be used to scale up 
the design in order to fit the entire DNN Model and increase throughput [329, 347]. However, 
the impact on amortizing the peripheral cost is minimal. Furthermore, an additional NoC must 
be introduced between the arrays. Accordingly, the limitations on energy efficiency and area 
density remain. 


Number of Rows Activated in Parallel 

Ideally, it would be desirable to use all rows (B) at once to maximize parallelism for high band- 
width and high throughput. In practice, the number of rows that can be used at once is limited 
to by several factors. 


1. The number of bits in the ADC, since more rows means more bits are required to resolve 
the accumulation (i.e., the partial sums will have more bits). Some works propose using 
fewer bits for ADC than the maximum required [361, 362], however, this can reduce the 
accuracy. 1 


2. The cumulative effect of the device variations can decrease the accuracy. 


3. The maximum voltage drop or accumulated current that can be tolerated by the bit line. 1° 
This can be particularly challenging for advanced process technologies (e.g., 7 nm and 
below) due to the increase in bit line resistance and increased susceptibility to electromi- 
gration issues, which limits the maximum current on the bit line. 


As a result, the typical number of rows activated in parallel is 64 [342] or below [344]. 
A digital accumulator can be used after each ADC to accumulate across all B rows in B/64 
cycles [342]; however, this reduces throughput and increases energy due to multiple ADC con- 
versions. To reduce the additional ADC conversion, recent work has explored performing the 
accumulation in the analog domain [348]. Figure 10.12 shows how this applies to our toy exam- 


ple. Design 10.23 shows the corresponding loop nest, and illustrates the multiple cycles it takes 
to perform all the MACs. 


15 The number of bits required by the ADC depends on the number of values being accumulated on the bit line (i.e., number 
of rows activated in parallel), whether the values are sparse [361] (i-e., zero values will not contribute to the accumulated sum), 
and whether the accumulated sum is a partial sum or a fully accumulated sum (i.e., it only needs to go through a nonlinear 
function to become an output activation). Using less than the maximum required ADC bits for the fully accumulated sum 
has less impact on accuracy than on the partial sum, since the fully accumulated sum is typically quantized to the bit-width 
of the input activation for the next layer, as discussed in Chapter 7. However, the ability to fully accumulate the sum on a bit 
line depends on the whether the number of rows in the array is large enough to hold all the weights for a given filter (i.e., 
B>CxHxW). 

16For instance, for a 6T SRAM bit cell, a large voltage drop on the bit line can cause the bit cell to flip (i.e, an unwanted 
write operation on the bit cell); using 8T bit cell can prevent this at the cost of increased area. 
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Figure 10.12: Example of limited number of rows activated in parallel. If the ADC is only 3- 
bits, only two rows can be used at a time. It would take two cycles (time steps) to complete the 
computation. There are two columns for psum in the figure: (1) psum (current cycle) corresponds 
to psum resulting from the dot product computed at the current cycle; (2) psum (accumulated) 
corresponds to the accumulated value of the psums across cycles. At t = 1, the psum of [6, 9, 
3, 3] is computed and added (e.g., with a digital adder) to the psum at t = 0 of [1, 4, 7, 2] to 
achieve the final psum [7, 13, 10, 5], as shown in the figure. 


Design 10.23 Toy matrix multiply loop nest with limited number of parallel active rows 





1 i = Array(CHW) # Input activations 

2 f = Array(CHW, M) # Filter weights 

3 o = Array(M) # Output partial sums 
5 parallel—for min [0, M): 

6 parallel—for chw1 in [0, CHW/2): 

7 for chw0 in [0, 2): 

8 chw = chw1*2 + chw0 

9 o[m] += i[chw] * f[chw, m] 





Number of Columns Activated in Parallel 

Ideally, it would be desirable to use all columns (A) at once to maximize parallelism for high 
bandwidth and high throughput. In practice, the number of columns that can be used are lim- 
ited by whether the area of ADC can pitch-match the width of the column, which is required 
for a compact area design; this can be challenging when using high-density storage elements 
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Figure 10.13: Example of limited number of columns activated in parallel. If the width of an 


ADC is equal to two columns, then the columns need to be time multiplexed. It would take two 
cycles to complete the computation. If we combined this with the previously described parallel 

















row limitations, it would take four cycles to complete the computation. 


Design 10.24 Toy matrix multiply loop nest with limited number of parallel active columns 





i = Array(CHW) # Input activations 
f = Array(CHW, M) # Filter weights 
o = Array(M) # Output partial sums 


parallel—for m1 in [0, M/2): 
parallel—for chwin [0, CHW): 
for mOin [0, 2): 
m = m1*2 + m0 


olm] += i[chw] * f[chw, m] 





such as NVM devices. A common solution is to time multiplex the ADC across a set of eight 
columns, which means that only A/8 columns are used in parallel [342]; however, this reduces 
throughput. Figure 10.13 shows how this applies to our toy example, and Design 10.24 shows 
the corresponding loop nest. 
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Pulse Width 
Moulation 
Figure 10.14: Example of performing pulse-width modulation of the input activations with a 
1-bit DAC. It would take three cycles to complete the computation if all weights can be used at 
once. Specifically, the input activations would be signaled across time as [1, 1,0, 1] + [0, 1, 0, 1] + 
[0, 0,0, 1] = [1,2,0,3], where the width of the pulse in time corresponds to the value of the 
input. There are two columns for psum in the figure: (1) psum (current cycle) corresponds to 
psum resulting from the dot product computed at the current cycle; (2) psum (accumulated) 
corresponds to the accumulated value of the psums across cycles. Note that if we combined 
the limitation illustrated in this figure with the previously described parallel row and columns 
limitations, it would take 12 cycles to complete the computation. 


Time to Deliver Input 

Ideally, it would be desirable for all bits in the input activations to be encoded onto the word line 
in the minimum amount of time to maximize throughput; a typical approach is to use voltage 
amplitude modulation [351]. In practice, this can be challenging due to 


1. the nonlinearity of devices makes encoding input value using voltage amplitude modula- 
tion difficult, and 


2. the complexity of the DAC that drives the word line scales with the number of bits 
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As a result, the input activations are often encoded in time (e.g., pulse-width modula- 
tion [354, 355] or number of pulses [362]'’), with a fixed voltage (DAC is only 1-bit) where 
the partial sum is determined by accumulating charge over time; however, this reduces through- 
put.! Figure 10.14 shows how this applies to our toy example. One approach to reduce the 
complexity of the DAC or current accumulation time is to reduced the precision of the input 
activations, as discussed in Chapter 7; however, this will also reduce accuracy. 


Time to Compute a MAC 

Ideally, it would be desirable for a MAC to be computed in a single cycle. In practice, the 
storage element (bit cell or device) typically can only perform one-bit operations (e.g., XNOR 
and AND), and thus multiple cycles are required to build up to a multi-bit operation (e.g., full 
adder and multiplication) [331]. Figure 10.15 shows how this applies to our toy example. This 
also requires additional logic after the ADC to combine the one-bit operations into a multi-bit 
operation. However, this will reduce both the throughput, energy and density. 


10.3 PROCESSING IN SENSOR 


In certain applications, such as image processing, the data movement from the sensor itself 
can account for a significant portion of the overall energy consumption. Accordingly, there has 
been work on bringing the processing near or into the sensor, which is similar to the work on 
bringing the processing near or into memory discussed in the previous sections. In both cases, 
the goal is to reduce the amount of data read out of the memory/sensor and thus the number of 
ADC conversions, which can be expensive. Both cases also require moving the computation into 
the analog domain and consequently suffer from increased sensitivity to circuit non-idealities. 
While processing near memory and processing in memory focus on reducing data movement 
of the weights of the DNN model, processing near sensor and processing in sensor focus on 
reducing the data movement of the inputs to the DNN model. 

Processing near sensor has been demonstrated for image processing applications, where 
computation can be performed in the analog domain before the ADC in the peripheral of the 
image sensor. For instance, Zhang et al. [363] and Lee et al. [364] use switched capacitors to 
perform 4-bit multiplications and 3-bit by 6-bit MAC operations, respectively. RedEye [365] 
proposes performing the entire convolution layer (including convolution, max pooling and quan- 
tization) in the analog domain before the ADC. It should be noted that the results in [365] are 
based on simulations, while [363, 364] report measurements from fabricated test chips. 

It is also feasible to embed the computation not just before the ADC, but directly into the 
sensor itself (i.e., processing in sensor). For instance, in [366] an Angle Sensitive Pixels sensor 
is used to compute the gradient of the input, which along with compression, reduces the data 


17Using pulses increases robustness to nonlinearity at the cost of increased switching activity. 
18 Alternatively, a single pulse can be used for the input activations if the weights are replicated across multiple rows (e.g, 
2N—! rows for an N-bit activation) [333]. This is a trade-off between time and area. 
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Figure 10.15: Example of time to compute MAC operation if each storage element can only 
perform one-bit operations. It takes three cycles to deliver the input (similar to Figure 10.14). 
‘There are two columns for psum in the figure: (1) psum (current cycle) corresponds to psum re- 
sulting from the dot product computed at the current cycle; (2) psum (accumulated) corresponds 
to the accumulated value of the psums across cycles. In addition, extra cycles are required at the 
end to combine accumulated bits from each bit line to form the final output sum. The number 
of cycles required to perform the shift and add would depend on the number of bit lines divide 
by the number of sets of shift-and-add logic. 


movement from the sensor by 10x. In addition, since the first layer of the DNN often outputs 
a gradient-like feature map, it may be possible to skip the computations in the first layer, which 
further reduces energy consumption, as discussed in [367, 368]. 


10.4 PROCESSING IN THE OPTICAL DOMAIN 


Processing in the optical domain is an area of research that is currently being explored as an 
alternative to all-electronic accelerators [369]. It is motivated, in part, by the fact that photons 
travel much faster than electrons, and the cost of moving a photon can be independent of dis- 
tance. Furthermore, multiplication can be performed passively (for example with optical inter- 
ference [370, 371], with reconfigurable filters [373], or static phase masks [374]) and detection 
can occur at over 100 GHz. Thus, processing in the optical domain may provide significant 
improvements in energy efficiency and throughput over the electrical domain. 
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Figure 10.16: Use ASP in front end to perform processing of first layer. (Figure from [367].) 


Much of the recent work in the optical computing has focused on performing matrix mul- 
tiplication, which can be used for DNN processing; these works are often referred to as photonic 
accelerators or optical neural networks. For instance, Shen et al. [370] present a programmable 
nanophotonic processor where the input activations are encoded in the amplitudes of optical 
pulses (light) that travel through an array of on-chip interferometers (composed of beamsplit- 
ters) that represent the weight matrix, where the weights determine the amount of light that 
is passed to the output. This is effectively a weight-stationary dataflow. The accumulation is 
performed based on the accumulated light from various waveguides at the photodetector. 

Alternatively, Hamerly et al. [371], shown in Figure 10.17b, demonstrate matrix mul- 
tiplication based on coherent detection, where both the weights and activations are encoded 
on-the-fly into light pulses, and are interfered in free-space on a beamsplitter to perform mul- 
tiplication. Since, in this scheme, there is no need for on-chip interferometers (which have a 
large footprint), this approach may be more scalable, at the cost of added complexity in align- 
ment. This is effectively an output-stationary dataflow, where the output is accumulated on the 
photodetector as an analog electronic signal. 

‘There is negligible power loss in the computation when processing in the optical domain. 
Most of the power dissipation occurs when converting between electrical and optical domains, 
specifically, in the converter to generate the light and the detector to collect the photons. There- 
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Figure 10.17: Optical neural networks. 


fore, similar to the processing in memory work, the larger the array (or in this case the matrix), 
the more these conversion costs can be amortized. 

Note, however, that while computing in the optical domain may be energy efficient, the 
non-idealities in the optical devices (e.g., crosstalk between detectors, errors in phase encoding, 
photodetection noise) can lead to a reduction in accuracy. To address this accuracy loss, Bernstein 
et al. [372] propose a hybrid electronic-optics approach where the data transfer is done in the 
optical domain to exploit the distance-independent cost of photons, while the computation itself 
(i.e., MAC operation) is performed digitally in the electrical domain to avoid the non-idealities 
of the optical devices. 

Recent works on optical neural networks have reported results based on simulations [371] 
or simulations based on data that has been extrapolated from experimental results [370]. These 
works demonstrate functionality on simple DNN models for digit classification and vowel recog- 
nition. 
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CHAPTER 11 


Conclusion 


The use of deep neural networks (DNNs) has recently seen explosive growth. They are currently 
widely used for many artificial intelligence (AI) applications including computer vision, speech 
recognition, and robotics and are often delivering better than human accuracy. However, while 
DNNs can deliver this outstanding accuracy, it comes at the cost of high computational com- 
plexity. With the stagnation of improvements in general-purpose computation [11], there is a 
movement toward more domain-specific hardware, and in particular for DNN processing. Con- 
sequently, techniques that enable efficient processing of DNNs to improve energy-efficiency and 
throughput without sacrificing accuracy with cost-effective hardware are critical to expanding the 
deployment of DNNs in both existing and new domains. 

Creating a system for efficient DNN processing should begin with understanding the 
current and future applications and the specific computations required for both now and the po- 
tential evolution of those computations. Therefore, this book surveyed a number of the current 
applications, focusing on computer vision applications, the associated algorithms, and the data 
being used to drive the algorithms. These applications, algorithms, and input data are experi- 
encing rapid change. So extrapolating these trends to determine the degree of flexibility desired 
to handle next generation computations becomes an important ingredient of any design project. 

During the design-space exploration process, it is critical to understand and balance the 
important system metrics. For DNN computation these include the accuracy, energy, through- 
put and hardware cost. Evaluating these metrics is, of course, key, so this book surveyed the 
important components of a DNN workload. In specific, a DNN workload has two major com- 
ponents. First, the workload consists of the “network architecture” of the DNN model including 
the “shape” of each layer and the interconnections between layers. These can vary both within 
and between applications. Second, the workload consists of the specific data input to the DNN. 
This data will vary with the input set used for training or the data input during operation for 
inference. 

This book also surveyed a number of avenues that prior work have taken to optimize 
DNN processing. Since data movement dominates energy consumption, a primary focus of 
some recent research has been to reduce data movement while maintaining accuracy, throughput, 
and cost. This means selecting architectures with favorable memory hierarchies like a spatial 
array, and developing dataflows that increase data reuse at the low-cost levels of the memory 
hierarchy. We have included a taxonomy of dataflows and an analysis of their characteristics. 
Understanding the throughput and energy efficiency of a DNN accelerator depends upon how 
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each DNN workload maps to the hardware. Therefore, we discussed the process of optimally 
mapping workloads to the accelerator and the associated throughput and energy models. 

The DNN domain also affords an excellent opportunity for hardware/algorithm co- 
design. Many works have aimed to save storage space and energy by changing the representation 
of data values in the DNN. We distill and present the key concepts from these approaches. Still 
other work saves energy and sometimes increases throughput by increasing and then exploiting 
sparsity of weights and/or activations. We presented a new abstract data representation that en- 
ables a systematic presentation of designs focused on exploiting sparsity. Co-design needs to be 
aware of the impact on accuracy. Therefore, to avoid losing accuracy it is often useful to mod- 
ify the network or fine-tune the network’s weights to accommodate these changes. Thus, this 
book both reviewed a variety of these techniques and discussed the frameworks that are available 
for describing, running and training networks. 

Finally, DNNs afford the opportunity to use mixed-signal circuit design and advanced 
technologies to improve efficiency. These include using memristors for analog computation and 
3-D stacked memory. Advanced technologies can also facilitate moving computation closer to 
the source by embedding computation near or within the sensor and the memories. Of course, all 
of these techniques should also be considered in combination, while being careful to understand 
their interactions and looking for opportunities for joint hardware/algorithm co-optimization. 

In conclusion, although much work has been done, DNNs remain an important area of 
research with many promising applications and opportunities for innovation at various levels of 
hardware design. We hope this book provides a structured way of navigating the complex space 
of DNN accelerators designs that will inspire and lead to new advances in the field. 
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